Comment by imtringued
Comment by imtringued 12 hours ago
The bitter lesson type of strategy would be to implement heterogeneous experts inside an MoE architecture so that the model automatically chooses the number of active parameters by routing to experts with more parameters.
This approach is much more efficient than the paper of this HN submission, because request based routing requires you to recalculate the KV cache from scratch as you switch from model to model.