Comment by petesergeant

Comment by petesergeant a day ago

> Built on top of Together Turbo Speculator, ATLAS reaches up to 500 TPS on DeepSeek-V3.1 and up to 460 TPS on Kimi-K2 in a fully adapted scenario — 2.65x faster than standard decoding, outperforming even specialized hardware like Groq

and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905

You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.

God I love OpenRouter.

KronisLV a day ago

> I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.

I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code

At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.

Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.

Reply View 0 replies

meander_water a day ago

Interesting, if you take a look at the median throughput chart [0], groq goes insane after 7th Oct. Wonder what happened.

[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance

Reply View 3 replies

sigmar a day ago

2x jump overnight. new LPU hardware? I checked the speed for groq's gpt-oss-120B, Llama4-maverick, and Llama4-scout; none of them had a noticeable change this month

Reply View | 0 replies
awestroke a day ago

Heavy quantization

Reply View | 1 reply
- petesergeant a day ago
  
  They claim (or someone on Reddit who claims to be staff claims) that's not accurate: https://www.reddit.com/r/LocalLLaMA/comments/1mk4kt0/comment...
  
  Reply View | 0 replies

immortal3 a day ago

There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130

Reply View 6 replies

rfoo a day ago

> It's known that such tricks reduce accuracy
AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.

Reply View | 3 replies
- martinald a day ago
  
  No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.
  
  Reply View | 0 replies
- Der_Einzige a day ago
  
  It’s quantization which is crippling accuracy…
  
  Reply View | 1 reply
  
  petesergeant 12 hours ago
  
  People all over this subthread saying that with no evidence provided. The company say they don’t — which would be pretty embarrassing to have to walk back — so who’s saying they do?
  
  Reply View | 0 replies
jsheard a day ago

> Groq and Cerebras use custom chips
Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.

Reply View | 1 reply
- petesergeant a day ago
  
  This is an "expensive for whom" question. I'd be keen to know if they're burning investor money hosting these right now or if they're able to run these at cost.
  
  Reply View | 0 replies

senko a day ago

> You'll see Groq averaging 1,086tps

What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...

OpenRouter numbers look fishy.

Reply View 1 reply

petesergeant 13 hours ago

Wonder if it’s prompt caching? OpenRouter is (I guess) just reporting actual throughput, where presumably groq is reporting a from-scratch figure? Just a guess tho.

Reply View | 0 replies

jbellis a day ago

groq is quantizing, even though it's not labeled as such on openrouter (super frustrating)

Reply View 2 replies

bn-l a day ago

Do you have a source for that? They are pretty close to the ref implementation on moonshot’s ranking

Reply View | 1 reply
- jbellis 7 hours ago
  
  https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe...
  
  Reply View | 0 replies

alecco a day ago

But Groq/Cerebras are hardware accelerators. It's an unrelated optimization. I wouldn't be surprised if they could also use speculators (today or in the future).

Reply View 0 replies

Havoc a day ago

>Groq and Cerebras often feel like the only games in town.

SambaNova should be similar...they've got a similar specialized hardware approach

Reply View 0 replies

p1esk a day ago

Do these numbers compare performance at the same cost?

Reply View 1 reply

petesergeant a day ago

You can see the cost in the links, and the answer is “pretty much” for the consumer. The backend maths, no idea.

Reply View | 0 replies