Comment by estimator7292

No doubt. I had a few ideas for what might be done:

1. Put the tokenizers or other lower-performance parts on the NPU.

2. Pipelining that moves things through different models or layers on different hardware.

3. If multiple layers, put most of them on the fastest part with a small number on the others. Like with hardware clocking, the ratio is decided to ensure the slower ones don't drag down overall performance.

In things like game or real-time AI's, esp multimodal, there's even more potential as some parts could be on different chips.