Comment by lalassu

Comment by lalassu 2 days ago

Disclaimer: I did not test this yet.

I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.

I hope it's not the same here.

msp26 2 days ago

K2 Thinking has immaculate vibes. Minimal sycophancy and a pleasant writing style while being occasionally funny.

If it had vision and was better on long context I'd use it so much more.

Reply View 0 replies

CuriouslyC a day ago

This was a bad problem with earlier Chinese (Qwen and Kimi K1 in particular) models, but the original DeepSeek delivered and GLM4.6 delivers. They don't diversify training as much as American labs so you'll find more edge cases and the interaction experience isn't quite as smooth, but the models put in work.

Reply View 0 replies

vorticalbox 2 days ago

This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.

I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.

Reply View 1 reply

make3 2 days ago

yes, they turned off all energy economy measures when benchmarking software activity was detected, which completely broke the point of the benchmarks because your phone is useless if it's very fast but the battery lasts one hour

Reply View | 0 replies

segmondy a day ago

Weird, I have gone local for the last 2 years. I use Chinese models 90% of the time, Kimi K2 Thinking, DeepSeekv3.Terminus, Qwen3 and GLM4.6. I'm not vibe testing it but really putting them to use and they do keep up great.

Reply View 0 replies

nylonstrung a day ago

My experience with deepseek and Kimi is quite the opposite: smarter than benchmarks would imply

Whereas the benchmark gains seem by new OpenAI, Grok and Claude models don't feel accompanied by vibe improvement

Reply View 0 replies

not_that_d 2 days ago

What is "Vibe testing"?

Reply View 3 replies

catigula 2 days ago

He means capturing things that benchmarks don't. You can use Claude and GPT-5 back-to-back in a field that score nearly identically on. You will notice several differences. This is the "vibe".

Reply View | 0 replies
BizarroLand 2 days ago

I would assume that it is testing how well and appropriately the LLM responds to prompts.

Reply View | 0 replies
[removed] 2 days ago

[deleted]

Reply View | 0 replies

make3 2 days ago

I would assume that huge amount is spent in frontier models just making the models nicer to interact with, as it is likely one of the main things that drives user engagement.

Reply View 0 replies

catigula 2 days ago

This is why I stopped bothering checking out these models and, funnily enough, grok.

Reply View 0 replies