Comment by viraptor

Comment by viraptor 2 days ago

6 replies

How well separated are experts per domain in a model like that? Specifically, if I'm interested in a programming use only, could we possibly strip it to one or two of them? Or should I assume a much wider spread? (And there would be some overlap anyway from the original root model)

renonce 2 days ago

My experience is that experts are not separated in any intuitive way. I would be very interested (and surprised) if someone manages to prune a majority of experts in a way that preserves model capabilities in a specific domain but not others.

See https://github.com/peteryuqin/Kimi-K2-Mini, a project that keeps a small portion of experts and layers and keep the model capabilities across multiple domains.

  • viraptor 2 days ago

    Sounds like dumping the routing information from programming questions would answer that... I guess I can do a dump from qwen or deepseek locally. You'd think someone would created that kind of graph already, but I couldn't find one.

    What I did find instead is that some MoE models are explicitly domain-routed (MoDEM), but it doesn't apply to deepseek which is just equally load balanced, so it's unlikely to apply to Kimi. On the other hand, https://arxiv.org/html/2505.21079v1 shows modality preferences between experts, even in mostly random training. So maybe there's something there.

orbital-decay 2 days ago

Inseparable, routing is done per token in a statistically optimal way, not per request on the knowledge domain basis.

  • viraptor 2 days ago

    Sure, it's done per token, but the question is: how much do the knowledge domains match up with experts. I could not find hard data on this.

    • boroboro4 2 days ago

      Check out DeepSeek v3 model paper. They changed the way they train experts (went from aux loss to different kind expert separation training). It did improve experts domain specialization, they have neat graphics on it in the paper.

  • [removed] 2 days ago
    [deleted]