Comment by imtringued

The bitter lesson type of strategy would be to implement heterogeneous experts inside an MoE architecture so that the model automatically chooses the number of active parameters by routing to experts with more parameters.

This approach is much more efficient than the paper of this HN submission, because request based routing requires you to recalculate the KV cache from scratch as you switch from model to model.