Comment by belter
Every two months, I run a very simple experiment to decide whether I should stop shorting NVDA....Think of it as my personal Pelican on a Bike test. :-)
Here is how it works: I take the latest state of the art model, usually one of the two or three currently being hyped....and ask it to create a short document that teaches Java, Python, or Rust, in 30 to 60 min, complete with code examples. Then I ask the same model to review its own produced artifact, for correctness and best practices.
What happens next is remarkably consistent. The model produces a glowing review, confidently declaring the document “production ready”… while the code either does not compile, contains obvious bugs, or relies on outright bad practices.
When I point this out, the model apologizes profusely and generates a “fixed” version which still contains errors. I rinse and repeat until I give up.
This is still true today, including with models like Opus 4.5 and ChatGPT 5.2. So whenever I read comments about these models being historic breakthroughs, I can’t help but imagine they are mostly coming from teams proudly generating technical debt at 100× the usual speed.
Things go even worst, when you ask the model to review a Cloud Architecture....
Ok, but if you wrote some massive corpus of code with no testing it probably would not compile either.
I think if you want to make this a useful experiment you should use one of the coding assistants that can test and iterate on its code, not some chatbot which is optimized to impress nontechnical people while being as cheap as possible to run.