Comment by timschmidt
Comment by timschmidt 4 days ago
Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM.
That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example.
> Experts can be swapped in and out of VRAM for each token.
I've often wondered how much it happens in practice. What does the per-token distribution of expert selection actually look like during inference? For example does it act like uniform random variable, or does it stick with the same 2 or 3 experts for 10 tokens in a row? I haven't been able to find much info on this.
Obviously it depends on what model you are talking about, so some kind of survey would be interesting. I'm sure this must but something that the big inference labs are knowledgeable about.
Although, I guess if you are batching things, then even if a subset of experts is selected for a single query, maybe over the batch it appears completely random, that would destroy any efficiency gains. Perhaps it's possible to intelligently batch queries that are "similar" somehow? It's quite an interesting research problem when you think about it.
Come to think of it, how does it work then for the "prompt ingestion" stage, where it likely runs all experts in parallel to generate the KV cache? I guess that would destroy any efficiency gains due to MoE too, so the prompt ingestion and AR generation stages will have quite different execution profiles.