Comment by zackangelo

Comment by zackangelo 10 months ago

Their inference server is written in Rust using huggingface’s Candle crate. One of the Moshi authors is also the primary author of Candle.

We’ve also been building our inference stack on top of Candle, I’m really happy with it.

baggiponte 10 months ago

Super interested. Do you have an equivalent of vLLM? Did you have to rewrite batching, paged attention…?

Reply View 6 replies

zackangelo 10 months ago

Yeah, I’ve had to rewrite continuous batching and other scheduling logic. That and multi-GPU inference have been the hardest things to build.
I’ll need to get paged attention working as well, but I think I can launch without it.

Reply View | 5 replies
- RRRozie 10 months ago
  
  Are you aiming for Nvidia hardware with rust-cuda, or looking to integrate with non-Nvidia hardware?
  
  Reply View | 1 reply
  
  zackangelo 9 months ago
  
  We used candle[0], which uses cudarc and the metal crate under the hood. That means we run on nvidia hardware in production and can test locally on macbooks with smaller models.
  I would certainly like to use non nvidia hardware but at this point it's not a priority. The subset of tensor operations needed to run the forward pass of LLMs isn't as large as you'd think though.
  [0] https://github.com/huggingface/candle
  
  Reply View | 0 replies
- k2so 10 months ago
  
  This is awesome, are you contributing this to candle or is it a standalone package?
  
  Reply View | 2 replies
  
  zackangelo 10 months ago
  
  Just trying to stay focused on launching first (https://docs.mixlayer.com) and keeping early customers happy, but would love to open source some of this work.
  It'd probably be a separate crate from candle. If you haven't checked it out yet, mistral.rs implements some of these things (https://github.com/EricLBuehler/mistral.rs). Eric hasn't done multi-GPU inference yet, but I know it's on his roadmap. Not sure if it helped, but I shared an early version of my llama 3.1 implementation with him.
  
  Reply View | 1 reply
  
  J_Shelby_J 10 months ago
  
  Hey, mixlayer is really cool.
  I also have a Rust LLM inference project. The overlap is very high between what mixlayer is doing and what my project is doing. It's actually crazy how we basically have the same features. [1] Right now I'm still using llama.cpp on the backend, but eventually want to move to candle via mistral.rs.
  [1] https://github.com/ShelbyJenkins/llm_client
  
  Reply View | 0 replies