Comment by diggan
> Luminal can run Q8 Llama 3 8B on M-series Macbooks at 15-25 tokens per second. The goal is to become the fastest ML framework for any model on any device.
Great that some numbers are provided, but in isolation, I'm not sure what they provide. It would be helpful to also share what tok/s you'd get with llama.cpp or something else on the same hardware, so we can actually understand if it's faster or not :) Also including the prompt processing would be a bonus!
Yeah those numbers look very low to me for something that's supposed to represent a state of the art optimization technique. I think that's lower than other implementations, although it depends on the MacBook.
Nonetheless this project looks very cool, and I hope they can continue improving it to the point where it indeed beats human-led optimizations.