Comment by pj_mukh

Comment by pj_mukh 6 days ago

Amazing, this is so so useful.

Thank you especially for the phone model vs tok/s breakdown. Do you have such tables for more models? For models even leaner than Gemma3 1B. How low can you go? Say if I wanted to tweak out 45toks/s on an iPhone 13?

P.S: Also, I'm assuming the speeds stay consistent with react-native vs. flutter etc?

rshemet 6 days ago

thank you! We're continue to add performance metrics as more data comes in.

A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.

Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.

Reply View 3 replies

pickettd 6 days ago

I also want to add on that I really appreciate the benchmarks.
When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.

Reply View | 0 replies
Reebz 6 days ago

Looking at the current benchmarks table, I was curious: what do you think is wrong with Samsung S25 Ultra?
Most of the standard mobile CPU benchmarks (GeekBench, AnTuTu, et al) show a 20-40% performance gain over S23/S24 Ultra. Also, this bucks the trend where most other devices are ranked appropriately (i.e. newer devices perform better).
Thanks for sharing your project.

Reply View | 1 reply
- rshemet 6 days ago
  
  great observation - this data is not from a controlled environment; these are metrics from our Cactus Chat use (we only collect tok/sec telemetry).
  S25 is an outlier that surprised us too.
  I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)
  
  Reply View | 0 replies