Comment by antirez
Comment by antirez 3 days ago
Why I do not believe this shows Anthropic serves folks a worse model:
1. The percentage drop is too low and oscillating, it goes up and down.
2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.
3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.
I believe the science, but I've been using it daily and it's been getting worse, noticeably.