Comment by estimator7292

Comment by estimator7292 a day ago

1 reply

If there are any LLM frameworks that can shard over disparate processor architectures I haven't heard of it.

It'd be pretty cool for sure, but you'd be absolutely strangled by memory bandwidth, I'd expect. LLM sure the chipset would not at all enjoy trying to route all that RAM to three processors at once.

nickpsecurity 21 hours ago

No doubt. I had a few ideas for what might be done:

1. Put the tokenizers or other lower-performance parts on the NPU.

2. Pipelining that moves things through different models or layers on different hardware.

3. If multiple layers, put most of them on the fastest part with a small number on the others. Like with hardware clocking, the ratio is decided to ensure the slower ones don't drag down overall performance.

In things like game or real-time AI's, esp multimodal, there's even more potential as some parts could be on different chips.