Comment by lalassu

Comment by lalassu 2 days ago

12 replies

Disclaimer: I did not test this yet.

I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.

I hope it's not the same here.

msp26 2 days ago

K2 Thinking has immaculate vibes. Minimal sycophancy and a pleasant writing style while being occasionally funny.

If it had vision and was better on long context I'd use it so much more.

CuriouslyC a day ago

This was a bad problem with earlier Chinese (Qwen and Kimi K1 in particular) models, but the original DeepSeek delivered and GLM4.6 delivers. They don't diversify training as much as American labs so you'll find more edge cases and the interaction experience isn't quite as smooth, but the models put in work.

vorticalbox 2 days ago

This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.

I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.

  • make3 2 days ago

    yes, they turned off all energy economy measures when benchmarking software activity was detected, which completely broke the point of the benchmarks because your phone is useless if it's very fast but the battery lasts one hour

segmondy a day ago

Weird, I have gone local for the last 2 years. I use Chinese models 90% of the time, Kimi K2 Thinking, DeepSeekv3.Terminus, Qwen3 and GLM4.6. I'm not vibe testing it but really putting them to use and they do keep up great.

nylonstrung a day ago

My experience with deepseek and Kimi is quite the opposite: smarter than benchmarks would imply

Whereas the benchmark gains seem by new OpenAI, Grok and Claude models don't feel accompanied by vibe improvement

not_that_d 2 days ago

What is "Vibe testing"?

  • catigula 2 days ago

    He means capturing things that benchmarks don't. You can use Claude and GPT-5 back-to-back in a field that score nearly identically on. You will notice several differences. This is the "vibe".

  • BizarroLand 2 days ago

    I would assume that it is testing how well and appropriately the LLM responds to prompts.

  • [removed] 2 days ago
    [deleted]
make3 2 days ago

I would assume that huge amount is spent in frontier models just making the models nicer to interact with, as it is likely one of the main things that drives user engagement.

catigula 2 days ago

This is why I stopped bothering checking out these models and, funnily enough, grok.