hasperdi 2 days ago

and can be faster if you can get an MOE model of that

  • dormento 2 days ago

    "Mixture-of-experts", AKA "running several small models and activating only a few at a time". Thanks for introducing me to that concept. Fascinating.

    (commentary: things are really moving too fast for the layperson to keep up)

    • hasperdi 2 days ago

      As pointed out by a sibling comment. MOE consists of a router and a number of experts (eg 8). These experts can be imagined as parts of the brain with specialization, although in reality they probably don't work exactly like that. These aren't separate models, they are components of a single large model.

      Typically, input gets routed to a number of of experts eg. top 2, leaving the others inactive. This reduces number of activation / processing requirements.

      Mistral is an example of a model that's designed like this. Clever people created converters to transform dense models to MOE models. These days many popular models are also available in MOE configuration

    • whimsicalism 2 days ago

      that's not really a good summary of what MoEs are. you can more consider it like sublayers that get routed through (like how the brain only lights up certain pathways) rather than actual separate models.

      • Mehvix 2 days ago

        The gains from MoE is that you can have a large model that's efficient, it lets you decouple #params and computation cost. I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'. These are totally different systems, IMO this shallow comparison muddies the water and does a disservice to each field of study. There's been loads of research showing there's redundancy in MoE models, ie cerebras has a paper[1] where they selectively prune half the experts with minimal loss across domains -- I'm not sure you could disable half the brain and notice a stupefying difference.

        [1] https://www.cerebras.ai/blog/reap

        • whimsicalism 17 hours ago

          > I don't see how anthropomorphizing MoE <-> brain affords insight deeper than 'less activity means less energy used'.

          I'm not saying it is a perfect analogy, but it is by far the most familiar one for people to describe what sparse activation means. I'm no big fan of over-reliance on biological metaphor in this field, but I think this is skewing a bit on the pedantic side.

          re: your second comment about pruning, not to get in the weeds but I think there have been a few unique cases where people did lose some of their brain and the brain essentially routed around it.

  • miohtama 2 days ago

    All modern models are MoE already, no?

    • hasperdi a day ago

      That's not the case. Some are dense and some are hybrid.

      MOE is not the holy grail, as there are drawbacks eg. less consistency, expert under/over-use

  • bigyabai 2 days ago

    >90% of inference hardware is faster if you run an MOE model.