Comment by wishawa
Fair point. If Moonshot is holding back the true weights or inference techniques that affect correctness, then providers including Together should call them out on that. I for one would stop using Kimi if that is the case.
Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.
I don't think it's weight being different or special inference techniques, more like they are not able to train the model to follow tool schema perfectly yet, and both Moonshot and Groq decided to use something like https://github.com/noamgat/lm-format-enforcer to make sure at least the output format is correct.