Comment by amluto
The best part when a “thinking” model carefully thinks and then says something that is obviously illogical, when the model clearly has both the knowledge and context to know it’s wrong. And then you ask it to double check and you give it a tiny hint about how it’s wrong, and it profusely apologizes, compliments you on your wisdom, and then says something else dumb.
I fully believe that LLMs encode enormous amounts of knowledge (some of which is even correct, and much of which their operator does not personally possess), are capable of working quickly and ingesting large amounts of data and working quickly, and have essentially no judgment or particularly strong intelligence of the non-memorized sort. This can still be very valuable!
Maybe this will change over the next few years, and maybe it won’t. I’m not at all convinced that scraping the bottom of the barrel for more billions and trillions of low-quality training tokens will help much.
I feel like one coding benchmark should be just telling it to double check or fix something that's actually perfectly fine repeatedly and watch how bad it deep fries your code base.