Comment by immortal3
There's another angle to this comparison. Groq and Cerebras use custom chips, but I'm not sure about Together. In this case, Together is sharing results based on the B200 GPU. Another important point is the accuracy of these speed-ups compared to the baseline model. It's known that such tricks reduce accuracy, but by how much? Kimi has already benchmarked several providers. https://x.com/Kimi_Moonshot/status/1976926483319763130
> It's known that such tricks reduce accuracy
AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.