Comment by edude03
I have the same experience despite using claude every day. As an funny anecdote:
Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests
There was an article on HN last week (?) which described this exact behaviour in the newer models.
Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.
Hopefully someone with a larger context window than myself can recall the article in question.