Comment by zackangelo

Comment by zackangelo a day ago

5 replies

Their inference server is written in Rust using huggingface’s Candle crate. One of the Moshi authors is also the primary author of Candle.

We’ve also been building our inference stack on top of Candle, I’m really happy with it.

baggiponte a day ago

Super interested. Do you have an equivalent of vLLM? Did you have to rewrite batching, paged attention…?

  • zackangelo a day ago

    Yeah, I’ve had to rewrite continuous batching and other scheduling logic. That and multi-GPU inference have been the hardest things to build.

    I’ll need to get paged attention working as well, but I think I can launch without it.

    • k2so a day ago

      This is awesome, are you contributing this to candle or is it a standalone package?

      • zackangelo 14 hours ago

        Just trying to stay focused on launching first (https://docs.mixlayer.com) and keeping early customers happy, but would love to open source some of this work.

        It'd probably be a separate crate from candle. If you haven't checked it out yet, mistral.rs implements some of these things (https://github.com/EricLBuehler/mistral.rs). Eric hasn't done multi-GPU inference yet, but I know it's on his roadmap. Not sure if it helped, but I shared an early version of my llama 3.1 implementation with him.

        • J_Shelby_J 7 hours ago

          Hey, mixlayer is really cool.

          I also have a Rust LLM inference project. The overlap is very high between what mixlayer is doing and what my project is doing. It's actually crazy how we basically have the same features. [1] Right now I'm still using llama.cpp on the backend, but eventually want to move to candle via mistral.rs.

          [1] https://github.com/ShelbyJenkins/llm_client