Comment by belter

Comment by belter 2 days ago

Every two months, I run a very simple experiment to decide whether I should stop shorting NVDA....Think of it as my personal Pelican on a Bike test. :-)

Here is how it works: I take the latest state of the art model, usually one of the two or three currently being hyped....and ask it to create a short document that teaches Java, Python, or Rust, in 30 to 60 min, complete with code examples. Then I ask the same model to review its own produced artifact, for correctness and best practices.

What happens next is remarkably consistent. The model produces a glowing review, confidently declaring the document “production ready”… while the code either does not compile, contains obvious bugs, or relies on outright bad practices.

When I point this out, the model apologizes profusely and generates a “fixed” version which still contains errors. I rinse and repeat until I give up.

This is still true today, including with models like Opus 4.5 and ChatGPT 5.2. So whenever I read comments about these models being historic breakthroughs, I can’t help but imagine they are mostly coming from teams proudly generating technical debt at 100× the usual speed.

Things go even worst, when you ask the model to review a Cloud Architecture....

gjimmel 2 days ago

Ok, but if you wrote some massive corpus of code with no testing it probably would not compile either.

I think if you want to make this a useful experiment you should use one of the coding assistants that can test and iterate on its code, not some chatbot which is optimized to impress nontechnical people while being as cheap as possible to run.

Reply View 4 replies

belter 2 days ago

>> Chatbot which is optimized to impress nontechnical people
Is that how we call Opus 4.5 now? :-)

Reply View | 3 replies
- rabf 2 days ago
  
  That depends a lot on the system prompt and the tooling available to the model. Are you trying thin in Claude code or Factory.ai, or are you using a chat interface? The difference in the outcome can be large.
  
  Reply View | 1 reply
  
  belter 2 days ago
  
  Random anecdotes from the Internet say no:
  "I paid for the $100 Claude Max plan so you don't have to - an honest review" -https://www.reddit.com/r/ClaudeAI/comments/1l5h2ds/i_paid_fo...
  
  Reply View | 0 replies
- gjimmel 2 days ago
  
  The name of the model is not the end of the story. There is a Pareto frontier of performance vs computational cost, and the companies have various knobs and dials they can tune to trade off performance for cost. This is why openai reports costs of $1k/problem when they test their models on the math/coding benchmarks, yet charge you only $15/month for a subscription to their web interface.
  
  Reply View | 0 replies

pigpop 2 days ago

I'm sorry but I don't quite believe you because I've done exactly this for learning much more complicated topics. For fun I've been learning about video game programming in the Odin programming language using a Claude project where I have Opus 4.5 write tutorials, including working code examples that are designed to be integrated with each other into a larger project. We've covered maze generation, Delaunay triangulation, MSTs, state machines, rendering via Raylib and RayGUI, and tweening for animations. All of those worked quite well with only very minor corrections which Opus was also very helpful for diagnosing and fixing. I also had it produce a full tutorial on implementing a relational database in Odin but I haven't had time to work my way through all of it yet. This is all with a somewhat niche language like Odin that I wouldn't expect there to be a lot of training data for so you'll excuse my incredulity that you couldn't get usable introductory code for much more commonly used languages like Java and Python.

I'm wondering if your test includes allowing the models to run their code in order to validate it and then fix it using the error output? Would you be willing to share the prompts and maybe some examples of the errors?

I haven't had many problems working in Claude Code even with full on "vibe coding". One notable recent exception was in writing integration tests for a p2p app that uses WebRTC, XTerm.js, and Yjs where it ran into some difficulty creating a testing framework that involved a headless client and a local MQTT broker where we had to fork a few child processes to test communication between them. Opus got bogged down working on its own so I stepped in and got things set up correctly (while chatting with Opus through the web interface instead of CC). The problem seemed to be due to overfilling the context since the test suite files were too long so I could have probably avoided the manual work by just having Opus break those up first.

Reply View 0 replies

cheevly 2 days ago

You clearly live in a different reality than me entirely. Complete opposite experience.

Reply View 1 reply

belter 2 days ago

Would you kindly please detail how your experience is different? Complete opposite experience without details does not really say much.

Reply View | 0 replies