Comment by zackangelo

Comment by zackangelo a day ago

3 replies

Yeah, I’ve had to rewrite continuous batching and other scheduling logic. That and multi-GPU inference have been the hardest things to build.

I’ll need to get paged attention working as well, but I think I can launch without it.

k2so a day ago

This is awesome, are you contributing this to candle or is it a standalone package?

  • zackangelo 14 hours ago

    Just trying to stay focused on launching first (https://docs.mixlayer.com) and keeping early customers happy, but would love to open source some of this work.

    It'd probably be a separate crate from candle. If you haven't checked it out yet, mistral.rs implements some of these things (https://github.com/EricLBuehler/mistral.rs). Eric hasn't done multi-GPU inference yet, but I know it's on his roadmap. Not sure if it helped, but I shared an early version of my llama 3.1 implementation with him.

    • J_Shelby_J 7 hours ago

      Hey, mixlayer is really cool.

      I also have a Rust LLM inference project. The overlap is very high between what mixlayer is doing and what my project is doing. It's actually crazy how we basically have the same features. [1] Right now I'm still using llama.cpp on the backend, but eventually want to move to candle via mistral.rs.

      [1] https://github.com/ShelbyJenkins/llm_client