Comment by diggan

Comment by diggan 3 days ago

2 replies

> Luminal can run Q8 Llama 3 8B on M-series Macbooks at 15-25 tokens per second. The goal is to become the fastest ML framework for any model on any device.

Great that some numbers are provided, but in isolation, I'm not sure what they provide. It would be helpful to also share what tok/s you'd get with llama.cpp or something else on the same hardware, so we can actually understand if it's faster or not :) Also including the prompt processing would be a bonus!

Reubend 3 days ago

Yeah those numbers look very low to me for something that's supposed to represent a state of the art optimization technique. I think that's lower than other implementations, although it depends on the MacBook.

Nonetheless this project looks very cool, and I hope they can continue improving it to the point where it indeed beats human-led optimizations.

jafioti 3 days ago

a lot of the search is still being optimized so we don't match super hand-optimized kernels like llama.cpp has, so we def don't match their tps yet, but i want to make a perf tracking page to see improvements over time and prevent regressions