Comment by Mehvix
The gains from MoE is that you can have a large model that's efficient, it lets you decouple #params and computation cost. I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'. These are totally different systems, IMO this shallow comparison muddies the water and does a disservice to each field of study. There's been loads of research showing there's redundancy in MoE models, ie cerebras has a paper[1] where they selectively prune half the experts with minimal loss across domains -- I'm not sure you could disable half the brain and notice a stupefying difference.
> I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'.
I'm not saying it is a perfect analogy, but it is by far the most familiar one for people to describe what sparse activation means. I'm no big fan of over-reliance on biological metaphor in this field, but I think this is skewing a bit on the pedantic side.
re: your second comment about pruning, not to get in the weeds but I think there have been a few unique cases where people did lose some of their brain and the brain essentially routed around it.