Comment by immortal3

Comment by immortal3 a day ago

There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130

rfoo a day ago

> It's known that such tricks reduce accuracy

AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.

Reply View 3 replies

martinald a day ago

No it shouldn't do. "All" you're doing is having a small model run the prompt and then have the large model "verify" it. When the large model diverges from the small one, you restart the process again.

Reply View | 0 replies
Der_Einzige a day ago

It’s quantization which is crippling accuracy…

Reply View | 1 reply
- petesergeant 12 hours ago
  
  People all over this subthread saying that with no evidence provided. The company say they don’t — which would be pretty embarrassing to have to walk back — so who’s saying they do?
  
  Reply View | 0 replies

jsheard a day ago

> Groq and Cerebras use custom chips

Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.

Reply View 1 reply

petesergeant a day ago

This is an "expensive for whom" question. I'd be keen to know if they're burning investor money hosting these right now or if they're able to run these at cost.

Reply View | 0 replies