Comment by antirez

Comment by antirez 3 days ago

Why I do not believe this shows Anthropic serves folks a worse model:

1. The percentage drop is too low and oscillating, it goes up and down.

2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.

3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.

levkk 3 days ago

I believe the science, but I've been using it daily and it's been getting worse, noticeably.

Reply View 15 replies

warkdarrior 3 days ago

Is it possible that your expectations are increasing, not that the model is getting worse?

Reply View | 3 replies
- GoatInGrey 3 days ago
  
  Possible, though you eventually run into types of issues that you recall the model just not having before. Like accessing a database or not following the SOP you have it read each time it performs X routine task. There are also patterns that are much less ambiguous like getting caught in loops or failing to execute a script it wrote after ten attempts.
  
  Reply View | 1 reply
  
  merlindru 3 days ago
  
  yes but i keep wondering if that's just the game of chance doing its thing
  like these models are nondeterministic right? (besides the fact that rng things like top k selection and temperature exist)
  say with every prompt there is 2% odds the AI gets it massively wrong. what if i had just lucked out the past couple weeks and now i had a streak of bad luck?
  and since my expectations are based on its previous (lucky) performance i now judge it even though it isn't different?
  or is it giving you consistenly worse performance, not able to get it right even after clearing context and trying again, on the exact same problem etc?
  
  Reply View | 0 replies
- F7F7F7 3 days ago
  
  I’ve had Opus struggle on trivial things that Sonnet 3.5 handled with ease.
  It’s not so much that the implementations are bad because the code is bad (the code is bad). It’s that it gets extremely confused and starts to frantically make worse and worse decisions and questioning itself. Editing multiple files, changing its mind and only fixing one or two. Reseting and overriding multiple batches of commits without so much as a second thought and losing days of work (yes, I’ve learned my lesson).
  It, the model, can’t even reason with the decisions it’s making from turn to turn. And the more opaque agentic help it’s getting the more I suspect that tasks are being routed to much lesser models (not the ones we’ve chosen via /model or those in our agent definitions) however Anthropic chooses.
  In these moments I mind as well be using Haiku.
  
  Reply View | 0 replies
davidee 3 days ago

I have to concur. And to the question about understanding what its good and bad at; no, tasks that it could accomplish quickly and easily just a month ago, now require more detailed prompting and constant "erroneous direction correction."
It's almost as if, as tool use and planning capabilities have expanded, Claude (as a singular product) is having a harder time coming up with simple approaches that just work, instead trying to use tools and patterns that complicate things substantially and introduce much more room for errors/errors of assumption.
It also regularly forgets its guidelines now.
I can't tell you how many times it's suggested significant changes/refactors to functions because it suddenly forgets we're working in an FP codebase and suggests inappropriate imperative solutions as "better" (often choosing to use language around clarity/consistency when the solutions are neither).
Additionally, it has started taking "initiative" in ways it did not before, attempting to be helpful but without gathering the context needed to do so properly when stepping outside the instruction set. It just ends up being much messier and inaccurate.
I have to regularly just clear my prompt and start again with guardrails that have either: already been established, or have not been needed previously / are only a result of the over-zealousness of the work its attempting to complete.

Reply View | 3 replies
- conception 3 days ago
  
  I assume, after any compacting of the context window that the session is more or less useless at that point I’ve never had consistent results after compacting.
  
  Reply View | 1 reply
  
  justinlivi 3 days ago
  
  Compacting equals death of the session in my process. I do everything I can to avoid hitting it. If I accidentally fly too close to the sun and compact I tend to revert and start fresh. As soon as it compacts it's basically useless
  
  Reply View | 0 replies
- F7F7F7 3 days ago
  
  Multiple concurrences a choir or a mob?
  1pm EST time it’s all down hill until around 8 or 9pm EST time.
  Late nights and weekends is smooth sailing.
  
  Reply View | 0 replies
bushbaba 2 days ago

I’m finding Gemini and chatGPT web terminal to out perform Claude code. The context becomes too much for the LLM, and tries to make up for it by doing more file read ops.

Reply View | 1 reply
- samusiam 2 days ago
  
  Sounds like you might want to refactor the code if the individual files are too big and it can't find what it's looking for?
  
  Reply View | 0 replies
emp17344 3 days ago

Any chance you’re just learning more about what the model is and is not useful for?

Reply View | 4 replies
- jerf 3 days ago
  
  I dunno about everyone else but when I learn more about what a model is and is not useful for, my subjective experience improves, not degrades.
  
  Reply View | 1 reply
  
  emp17344 3 days ago
  
  Not when the product is marketed as a panacea.
  
  Reply View | 0 replies
- data-ottawa 3 days ago
  
  There are some days where it acts staggeringly bad, beyond baselines.
  But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…
  There’s too many variables and no hard evidence shared by Anthropic.
  
  Reply View | 0 replies
- acuozzo 3 days ago
  
  No because switching to the API with the same prompt immediately fixes it.
  There's little incentive to throttle the API. It's $/token.
  
  Reply View | 0 replies

TIPSIO 3 days ago

I too suspect the A/B testing is the prime suspect: context window limits, system prompts, MAYBE some other questionable things that should be disclosed.

Either way, if true, given the cost I wish I could opt-out or it were more transparent.

Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.

All speculation though

Reply View 2 replies

F7F7F7 3 days ago

Whenever I see new behaviors and suspect I’m being tested on I’ll typically see a feedback form at some point in that session. Well, that and dropping four letter words.
I know it’s more random sampling than not. But they are definitely using our codebases (and in some respects our livelihoods) as their guinea pigs.

Reply View | 0 replies
samusiam 2 days ago

If that's the case, then as a benchmark operator you'd want to run the benchmark through multiple different accounts on different machines to average over A/B test noise.

Reply View | 0 replies

make3 3 days ago

It would be very easy for them to switch the various (compute) cost vs performance knobs down depending on load to maintain a certain latency; you would see oscillations like this, especially if the benchmark is not always run exactly at the same time every day.

& it would be easy for them to start with a very costly inference setup for a marketing / reputation boost, and slowly turn the knobs down (smaller model, more quantized model, less thinking time, fewer MoE experts, etc)

Reply View 0 replies

littlestymaar 3 days ago

> 1. The percentage drop is too low and oscillating, it goes up and down.

How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…

Reply View 0 replies

eterm 3 days ago

4. The graph starts January 8.

Why January 8? Was that an outlier high point?

IIRC, Opus 4.5 was released late november.

Reply View 5 replies

F7F7F7 3 days ago

Right after the Holiday double token promotion users felt (perceived) a huge regression in capabilities. I bet that triggered the idea.

Reply View | 0 replies
pertymcpert 3 days ago

People were away for the holidays. What do you want them to do?

Reply View | 0 replies
littlestymaar 3 days ago

Or maybe, juste maybe, that's when they started testing…

Reply View | 2 replies
- eterm 3 days ago
  
  Wayback machine has nothing for this site before today, and article is "last updated Jan 29".
  A benchmark like this ought to start fresh from when it is published.
  I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.
  
  Reply View | 1 reply
  
  littlestymaar 3 days ago
  
  Which makes sense, you gotta wait until you get enough data before you can communicate on the said data…
  If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.
  
  Reply View | 0 replies