Comment by yorwba

Comment by yorwba 3 days ago

1 reply

The model is explicitly trained to produce as uniform a distribution as possible, because it's designed for batched inference with a batch size much larger than the expert count, so that all experts are constantly activated and latency is determined by the highest-loaded expert, so you want to distribute the load evenly to maximize utilization.

Prompt ingestion is still fairly similar to that setting, so you can first compute the expert routing for all tokens, load the first set of expert weights and process only those tokens that selected the first expert, then load the second expert and so on.

But if you want to optimize for single-stream token generation, you need a completely different model design. E.g. PowerInfer's SmallThinker moved expert routing to a previous layer, so that the expert weights can be prefetched asynchronously while another layer is still executing: https://arxiv.org/abs/2507.20984