Comment by rfoo
If you compare "schema validation error count" plus "Count of Finish Reason others" then SiliconFlow and Infinigence is in the same bucket too. Maybe their API layer detected incorrect tool call and set finish reason to something else?
IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.
Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.
So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.
Fair point. If Moonshot is holding back the true weights or inference techniques that affect correctness, then providers including Together should call them out on that. I for one would stop using Kimi if that is the case.
Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.