Comment by wishawa
Comment by wishawa a day ago
Inference is impressively fast. But what about quality? In the Kimi vendor verifier (https://github.com/MoonshotAI/K2-Vendor-Verifier/), Together has one of the highest tool call failure rates (>300 failures over the benchmark, compared to 0-2 for the official API, groq, SiliconFlow, and Infinigence).
I don't know anything about Together quality in general, but the specific technique discussed here (speculative decoding) has no impact on the quality of generations. So you should be able to apply it to whichever model you want, and see the advertised speedup while retaining the quality of your base model.