Comment by baggiponte

Comment by baggiponte a year ago

Super interested. Do you have an equivalent of vLLM? Did you have to rewrite batching, paged attention…?

Yeah, I’ve had to rewrite continuous batching and other scheduling logic. That and multi-GPU inference have been the hardest things to build.

I’ll need to get paged attention working as well, but I think I can launch without it.

Reply View 5 replies

RRRozie a year ago

Are you aiming for Nvidia hardware with rust-cuda, or looking to integrate with non-Nvidia hardware?

Reply View | 1 reply
- zackangelo a year ago
  
  We used candle[0], which uses cudarc and the metal crate under the hood. That means we run on nvidia hardware in production and can test locally on macbooks with smaller models.
  I would certainly like to use non nvidia hardware but at this point it's not a priority. The subset of tensor operations needed to run the forward pass of LLMs isn't as large as you'd think though.
  [0] https://github.com/huggingface/candle
  
  Reply View | 0 replies
k2so a year ago

This is awesome, are you contributing this to candle or is it a standalone package?

Reply View | 2 replies
- zackangelo a year ago
  
  Just trying to stay focused on launching first (https://docs.mixlayer.com) and keeping early customers happy, but would love to open source some of this work.
  It'd probably be a separate crate from candle. If you haven't checked it out yet, mistral.rs implements some of these things (https://github.com/EricLBuehler/mistral.rs). Eric hasn't done multi-GPU inference yet, but I know it's on his roadmap. Not sure if it helped, but I shared an early version of my llama 3.1 implementation with him.
  
  Reply View | 1 reply
  
  J_Shelby_J a year ago
  
  Hey, mixlayer is really cool.
  I also have a Rust LLM inference project. The overlap is very high between what mixlayer is doing and what my project is doing. It's actually crazy how we basically have the same features. [1] Right now I'm still using llama.cpp on the backend, but eventually want to move to candle via mistral.rs.
  [1] https://github.com/ShelbyJenkins/llm_client
  
  Reply View | 0 replies