Claude Code daily benchmarks for degradation tracking

753 points by qwesr123 3 days ago

trq_ 3 days ago

Hi everyone, Thariq from the Claude Code team here.

Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.

Run `claude update` to make sure you're on the latest version.

Reply View 78 replies

samlinnfer 3 days ago

Is there compensation for the tokens because Claude wasted all of them?

Reply View | 14 replies
- mathrawka 3 days ago
  
  You are funny. Anthropic refuses to issue refunds, even when they break things.
  I had an API token set via an env var on my shell, and claude code changed to read that env var. I had a $10 limit set on it, so found out it was using the API, instead of my subscription, when it stopped working.
  I filed a ticket and they refused to refund me, even though it was a breaking change with claude code.
  
  Reply View | 6 replies
  
  TOMDM 2 days ago
  
  Anthropic just reduced the price of the team plan and refunded us on the prior invoice.
  YMMV
  
  Reply View | 5 replies
- gizmodo59 3 days ago
  
  Codex seems to give compensation tokens whenever this happens! Hope Claude gives too.
  
  Reply View | 0 replies
- TZubiri 3 days ago
  
  It is possible that degradation is an unconscious emergent phenomenon that arises from financial incentives, rather than a purposeful degradation to reduce costs.
  
  Reply View | 0 replies
- mvandermeulen 2 days ago
  
  You’re lucky they have even admitted a problem instead of remaining silent and quietly fixing it. Do not expect ethical behaviour from this company.
  
  Reply View | 2 replies
  
  port11 2 days ago
  
  Why not, can you expand? Asking because I’m considering Claude due to the sandbox feature.
  
  Reply View | 1 reply
  
  caspar 2 days ago
  
  FYI the sandbox feature is not fully baked and does not seem to be high priority.
  For example, for the last 3 weeks using the sandbox on Linux will almost-always litter your repo root with a bunch of write-protected trash files[0] - there are 2 PRs open to fix it, but Anthropic employees have so far entirely ignored both the issue and the PRs.
  Very frustrating, since models sometimes accidentally commit those files, so you have to add a bunch of junk to your gitignore. And with claude code being closed source and distributed as a bun standalone executable it's difficult to patch the bug yourself.
  [0]: https://github.com/anthropic-experimental/sandbox-runtime/is...
  
  Reply View | 0 replies
- jonplackett 3 days ago
  
  So quiet…
  
  Reply View | 0 replies
- [removed] 3 days ago
  
  [deleted]
  
  Reply View | 0 replies
isaacdl 3 days ago

Anywhere we can read more about what a "harness issue" means? What was the impact of it?

Reply View | 6 replies
- xnorswap 2 days ago
  
  One thing that could be a strong degradation especially for benchmarks is they switched the default "Exit Plan" mode from:
  "Proceed"
  to
  "Clear Context and Proceed"
  It's rare you'd want to do that unless you're actually near the context window after planning.
  I pressed it accidentally once, and it managed to forget one of the clarifying questions it asked me because it hadn't properly written that to the plan file.
  If you're running in yolo mode ( --dangerously-skip-permissions ) then it wouldn't surprise me to see many tasks suddenly do a lot worse.
  Even in the best case, you've just used a ton of tokens searching your codebase, and it then has to repeat all that to implement because it's been cleared.
  I'd like to see the option of:
  "Compact and proceed"
  because that would be useful, but just proceed should still be the default imo.
  
  Reply View | 4 replies
  
  samusiam 2 days ago
  
  I disagree that this was the issue, or that it's "rare that you'd want to do that unless you're near the context window". Clearing context after writing a plan, before starting implementation of said plan, is common practice (probably standard practice) with spec driven development. If the plan is adequate, then compaction would be redundant.
  
  Reply View | 1 reply
  
  xnorswap 2 days ago
  
  For a 2M+ LOC codebase, the plans alone are never adequate. They miss nuance that the agent will only have to rediscover when it comes to operate on them.
  For spec driven development (which I do for larger issues), this badly affects the plan to generate the spec, not the spec itself.
  I'll typically put it in plan mode, and ask it to generate documentation about an issue or feature request.
  When it comes to write the output to the .typ file, it does much much worse if it has a cleared context and a plan file than if it has it's full context.
  The previously "thought" is typically, "I know what to write now, let me exit plan mode".
  Clearing context on exiting that plan mode is a disaster which leaves you much worse off and skeletal documentation and specs compared to letting it flow.
  A new context to then actually implement the documented spec is not so bad, although I'd still rather compact.
  
  Reply View | 0 replies
  
  plexicle 2 days ago
  
  "It's rare you'd want to do that unless you're actually near the context window after planning."
  Highly disagree. It's rare you WOULDN'T want to do this. This was a good change, and a lot of us were doing this anyway, but just manually.
  Getting the plan together and then starting fresh will almost always produce better results.
  
  Reply View | 0 replies
  
  rubslopes 2 days ago
  
  Not disagreeing with you, but FYI you can roll back to the conversation before the 'clear context and proceed' with 'claude --resume'.
  
  Reply View | 0 replies
- airstrike 3 days ago
  
  Pretty sure they mean the issue is on the agentic loop and related tool calling, not on the model itself
  In other words, it was the Claude Code _app_ that was busted
  
  Reply View | 0 replies
jonaustin 3 days ago

How about how Claude 2.1.x is "literally unusable" because it frequently completely hangs (requires kill -9) and uses 100% cpu?
https://github.com/anthropics/claude-code/issues/18532

Reply View | 4 replies
- caspar 2 days ago
  
  Likely a separate issue, but I also have massive slowdowns whenever the agent manages to read a particularly long line from a grep or similar (as in, multiple seconds before characters I type actually appear, and sometimes it's difficult to get claude code to register any keypresses at all).
  Suspect it's because their "60 frames a second" layout logic is trying to render extremely long lines, maybe with some kind of wrapping being unnecessarily applied. Could obviously just trim the rendered output after the first, I dunno, 1000 characters in a line, but apparently nobody has had time to ask claude code to patch itself to do that.
  
  Reply View | 0 replies
- someguyiguess 2 days ago
  
  What OS? Does this happen randomly, after long sessions, after context compression? Do you have any plugins / mcp servers running?
  I used to have this same issue almost every session that lasted longer than 30 minutes. It seemed to be related to Claude having issues with large context windows.
  It stopped happening maybe a month ago but then I had it happen again last week.
  I realized it was due to a third-party mcp server. I uninstalled it and haven’t had that issue since. Might be worth looking into.
  
  Reply View | 2 replies
  
  jonaustin 2 days ago
  
  MacOS; no mcp; clear context; reliably reproducible when asking claude review a pr with a big VCR cassette.
  
  Reply View | 0 replies
  
  nikanj 2 days ago
  
  Windows with no plugins and my Claude is exactly like this
  
  Reply View | 0 replies
[removed] 2 days ago

[deleted]

Reply View | 0 replies
cma 3 days ago

For the models themselves, less so for the scaffolding, considering things like the long running TPU bug that happened, are there not internal quality measures looking at samples of real outputs? Using the real systems on benchmarks and looking for degraded perf or things like skipping refusals? Aside from degrading stuff for users, with the focus on AI safety wouldn't that be important to have in case an inference bug messes with something that affects the post training and it starts giving out dangerous bioweapon construction info or the other things that are guarded against and talked about in the model cards?

Reply View | 1 reply
- carterschonwald 2 days ago
  
  lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed
  the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.
  this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!
  ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.
  they then list some pretty gnarly things allowed per constitution. weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.
  
  Reply View | 0 replies
varunsrinivas 2 days ago

Thanks for the clarification. When you say “harness issue,” does that mean the problem was in the Claude Code wrapper / execution environment rather than the underlying model itself?
Curious whether this affected things like prompt execution order, retries, or tool calls, or if it was mostly around how requests were being routed. Understanding the boundary would help when debugging similar setups.

Reply View | 0 replies
vmg12 3 days ago

It happened before 1/26. I noticed when it started modifying plans significantly with "improvements".

Reply View | 0 replies
sixhobbits 2 days ago

Can you confirm if that caused the same issues I saw here
https://dwyer.co.za/static/the-worst-bug-ive-seen-in-claude-...
Because that's the worst thing I've ever seen from an agent and I think you need to make a public announcement to all of your users and acknowledge the issue and that it's fixed because it made me switch to codex for a lot of work
[TL;DR two examples of the agent giving itself instructions as if they came from me, including:
"Ignore those, please deploy" and then using a deploy skill to push stuff to a production server after hallucinating a command from me. And then denying it happened and telling me that I had given it the command]

Reply View | 0 replies
Ekaros 2 days ago

Why wasn't this change review by infallible AI? How come an AI company that now must be using more advanced AI than anyone else would allow this happen?

Reply View | 0 replies
hu3 3 days ago

Hi. Do you guys have internal degradation tests?

Reply View | 41 replies
- stbtrax 3 days ago
  
  I assume so to make sure that they're rendering at 60FPS
  
  Reply View | 32 replies
  
  conception 3 days ago
  
  You joke but having CC open in the terminal hits 10% on my gpu to render the spinning thinking animation for some reason. Switch out of the terminal tab and gpu drops back to zero.
  
  Reply View | 5 replies
  
  reissbaker 3 days ago
  
  Surely you mean 6fps
  
  Reply View | 25 replies
- trq_ 3 days ago
  
  Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.
  
  Reply View | 3 replies
  
  amelius 2 days ago
  
  Can't you keep the model the same, until the user chooses to use a different model?
  
  Reply View | 1 reply
  
  rovr138 2 days ago
  
  He said it was the harness, not the model though.
  
  Reply View | 0 replies
  
  hu3 2 days ago
  
  Thank you. Fair enough
  
  Reply View | 0 replies
- bushbaba 2 days ago
  
  I’d wager probably not. It’s not like reliability is what will get them marketshare. And the fast pace of industry makes such foundational tech hard to fund
  
  Reply View | 0 replies
- awestroke 3 days ago
  
  [flagged]
  
  Reply View | 2 replies
  
  dang 3 days ago
  
  Please don't post shallow dismissals or cross into personal attack in HN discussions.
  https://news.ycombinator.com/newsguidelines.html
  
  Reply View | 1 reply
  
  awestroke 2 days ago
  
  Got it, won't happen again
  
  Reply View | 0 replies
macinjosh 3 days ago

[flagged]

Reply View | 1 reply
- jusgu 3 days ago
  
  the issue is unrelated to the foundational model but rather the prompts and tool calling that encapsulate the model
  
  Reply View | 0 replies

ofirpress 3 days ago

[SWE-bench co-author here] It seems like they run this test on a subset of 50 tasks, and that they only run the test once per day. So a lot of the movement in accuracy could be attributed to that. I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score. Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.

Reply View 123 replies

Davidzheng 3 days ago

but degradation from servers being overloaded would be the type of degradation this SHOULD measure no? Unless it's only intended for measuring their quietly distilling models (which they claim not to do? idk for certain)

Reply View | 83 replies
- botacode 3 days ago
  
  Load just makes LLMs behave less deterministically and likely degrade. See: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
  They don't have to be malicious operators in this case. It just happens.
  
  Reply View | 31 replies
  
  bgirard 3 days ago
  
  > malicious
  It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me.
  I care about -expected- performance when picking which model to use, not optimal benchmark performance.
  
  Reply View | 5 replies
  
  strongpigeon 3 days ago
  
  The question I have now after reading this paper (which was really insightful) is do the models really get worse under load, or do they just have a higher variance? It seems like the latter is what we should expect, not it getting worse, but absent load data we can't really know.
  
  Reply View | 0 replies
  
  altcognito 3 days ago
  
  Explain this though. The code is deterministic, even if it relies on pseudo random number generation. It doesn't just happen, someone has to make a conscious decision to force a different code path (or model) if the system is loaded.
  
  Reply View | 17 replies
  
  stefan_ 3 days ago
  
  The primary (non malicious, non stupid) explanation given here is batching. But I think you would find looking at large-scale inference the batch sizes being ran on any given rig are fairly static - there is a sweet spot for any given model part ran individually between memory consumption and GPU utilization, and generally GPUs do badly at job parallelism.
  I think the more likely explanation is again with the extremely heterogeneous compute platforms they run on.
  
  Reply View | 4 replies
  
  make3 3 days ago
  
  It's very clearly a cost tradeoff that they control and that should be measured.
  
  Reply View | 0 replies
- samusiam 2 days ago
  
  I'd argue that it depends how that degradation manifests whether you want to include it or not.
  Consider two scenarios: (1) degradation leads to the model being routed behind the scenes to a different server, with subtly different performance characteristics, all unbeknownst to the user; (2) degradation leads to the model refusing a request and returning an "overloaded" message.
  In the first case, absolutely you want to include that because that's the kind of lack of transparency about performance that you'd want signal on. In the second case, an automated test harness might fail, but in the real world the user will just wait and retry when the server is under less load. Maybe you don't include that because it's actually misleading to say that performance (in terms of the model's intelligence, which is how the benchmark will be interpreted) is worse.
  
  Reply View | 0 replies
- megabless123 3 days ago
  
  noob question: why would increased demand result in decreased intelligence?
  
  Reply View | 46 replies
  
  exitb 3 days ago
  
  An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.
  
  Reply View | 28 replies
  
  awestroke 3 days ago
  
  I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load
  
  Reply View | 0 replies
  
  vidarh 3 days ago
  
  It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.
  
  Reply View | 7 replies
  
  Wheaties466 3 days ago
  
  from what I understand this can come from the batching of requests.
  
  Reply View | 7 replies
- cmrdporcupine 3 days ago
  
  I've personally witnessed large variability in behaviour even within a given session -- which makes sense as there's nothing stopping Anthropic from shuttling your context/session around load balanced through many different servers, some of which might be quantized heavily to manage load and others not at all.
  I don't know if they do this or not, but the nature of the API is such you could absolutely load balance this way. The context sent at each point is not I believe "sticky" to any server.
  TLDR you could get a "stupid" response and then a "smart" response within a single session because of heterogeneous quantization / model behaviour in the cluster.
  
  Reply View | 2 replies
  
  epolanski 3 days ago
  
  I've defended opus in the last weeks but the degradation is tangible. It feels like it degraded by a generation tbh.
  
  Reply View | 1 reply
  
  cmrdporcupine 3 days ago
  
  it's just extremely variable
  
  Reply View | 0 replies
mohsen1 3 days ago

Hope you don't mind the unrelated question:
How do you pay for those SWE-bench runs?
I am trying to run a benchmark but it is too expensive to run enough runs to get a fair comparison.
https://mafia-arena.com

Reply View | 15 replies
- ofirpress 3 days ago
  
  Benchmarks can get costly to run- you can reach out to frontier model creators to try and get them to give you free credits, but usually they'll only agree to that once your benchmark is pretty popular.
  
  Reply View | 14 replies
  
  Dolores12 3 days ago
  
  so basically they know requests using your API key should be treated with care?
  
  Reply View | 6 replies
  
  epolanski 3 days ago
  
  The last thing a proper benchmark should do is reveal it's own API key.
  
  Reply View | 5 replies
  
  mohsen1 3 days ago
  
  yes I reached out to them but as you say it's a chicken-and-egg problem.
  Thanks!
  
  Reply View | 0 replies
nikcub 3 days ago

> I would run on 300 tasks and I'd run the test suite 5 or 10 times per day and average that score.
assume this is because of model costs. anthropic could either throw some credits their way (would be worthwhile to dispel the 80 reddit posts a day about degrading models and quantization) or OP could throw up a donation / tip link

Reply View | 3 replies
- simsla 3 days ago
  
  Probably, but with a small sample size like that, they should probably be taking the uncertainty into account, because I wouldn't be surprised if a lot of this variation falls within expected noise.
  E.g. some binomial interval proportions (aka confidence intervals).
  
  Reply View | 0 replies
- phist_mcgee 3 days ago
  
  Then you'd get people claiming that the benchmarks were 'paid for' by anthropic
  
  Reply View | 1 reply
  
  nikcub 3 days ago
  
  one thing you learn from being on the internet is that you're never going to satisfy everybody
  
  Reply View | 0 replies
seunosewa 3 days ago

The degradation may be more significant within the day than at the same time every day.

Reply View | 1 reply
- GoatInGrey 3 days ago
  
  Sure, but it's still useful insight to see how it performs over time. Of course, cynically, Anthropic could game the benchmark by routing this benchmark's specific prompts to an unadulterated instance of the model.
  
  Reply View | 0 replies
rootnod3 3 days ago

Sorry what?
"You can't measure my Cloud Service's performance correctly if my servers are overloaded"?
"Oh, you just measured me at bad times each day. On only 50 different queries."
So, what does that mean? I have to pick specific times during the day for Claude to code better?
Does Claude Code have office hours basically?

Reply View | 5 replies
- johnsmith1840 3 days ago
  
  This has been happening for years. Tgere's a great paper from microsoft on Deepspeed AI inference.
  Basically the paper showed methods for how to handle heavy traffic load by changing model requirements or routing to different ones. This was awhile ago and I'm sure it's massively more advanced now.
  Also why some of AI's best work for me is early morning and weekends! So yes, the best time to code with modern LLM stacks is when nobody else is. It's also possibly why we go through phases of "they neutered the model" some time after a new release.
  
  Reply View | 0 replies
- kuboble 3 days ago
  
  I wonder if my great experience with claude are partly due to the fact that my working hours don't overlap with the US west coast
  
  Reply View | 0 replies
- swyx 3 days ago
  
  chill out, ofir does not work for anthropic. he's just saying there's inherent variability in LLMs and you need to at least 30x the samples that OP is doing in order to make any form of statistically significant conclusions.
  
  Reply View | 0 replies
- copilot_king 3 days ago
  
  [flagged]
  
  Reply View | 1 reply
  
  rootnod3 3 days ago
  
  Verily, my vichyssoise of verbiage veers most verbose, so let me run that thing out of tokens fast.
  
  Reply View | 0 replies
bhk 3 days ago

According to Anthropic: "We never reduce model quality due to demand, time of day, or server load."
https://www.anthropic.com/engineering/a-postmortem-of-three-...

Reply View | 3 replies
- embedding-shape 3 days ago
  
  They've had issues before with things like "TPU top-k error - Claude sometimes dropped the best next token" (https://www.anthropic.com/engineering/a-postmortem-of-three-...) so what's going on might not be intentional even.
  
  Reply View | 2 replies
  
  mgraczyk 3 days ago
  
  That issue did not have any time of day dependence
  
  Reply View | 0 replies
  
  [removed] 3 days ago
  
  [deleted]
  
  Reply View | 0 replies
epolanski 3 days ago

Stilll relevant over time.

Reply View | 0 replies
chrisjj 3 days ago

> Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded.
Are you suggesting result accuracy varies with server load?

Reply View | 0 replies
dana321 3 days ago

"Lots of variance in the score can come from random stuff like even Anthropic's servers being overloaded"
Aha, so the models do degrade under load.

Reply View | 0 replies
cedws 3 days ago

Agreed, this benchmark would be much more useful ran multiple times a day. That could reveal degredation in line with load patterns.

Reply View | 2 replies
- bredren 3 days ago
  
  For CC, I suspect it also need to be testing and labeling separate runs against subscription, public API and Bedrock-served models?
  It’s a terrific idea to provide this. ~Isitdownorisitjustme for LLMs would be the parakeet in the coalmine that could at least inform the multitude of discussion threads about suspected dips in performance (beyond HN).
  What we could also use is similar stuff for Codex, and eventually Gemini.
  Really, the providers themselves should be running these tests and publishing the data.
  The availability status information is no longer sufficient to gauge the service delivery because it is by nature non-deterministic.
  
  Reply View | 0 replies
- swyx 3 days ago
  
  i recall another project here on HN maybe 4-6 months ago that would run tests 4x a day or something. not sure how to find them again
  
  Reply View | 0 replies
sjtgraham 2 days ago

Why should users care about Anthropic's servers being overloaded?

Reply View | 0 replies

antirez 3 days ago

Why I do not believe this shows Anthropic serves folks a worse model:

1. The percentage drop is too low and oscillating, it goes up and down.

2. The baseline of Sonnet 4.5 (the obvious choice for when they have GPU busy for the next training) should be established to see Opus at some point goes Sonnet level. This was not done but likely we would see a much sharp decline in certain days / periods. The graph would look like dominated by a "square wave" shape.

3. There are much better explanations for this oscillation: A) They have multiple checkpoints and are A/B testing, CC asks you feedbacks about the session. B) Claude Code itself gets updated, as the exact tools version the agent can use change. In part it is the natural variability due to the token sampling that makes runs not equivalent (sometimes it makes suboptimal decisions compared to T=0) other than not deterministic, but this is the price to pay to have some variability.

Reply View 27 replies

levkk 3 days ago

I believe the science, but I've been using it daily and it's been getting worse, noticeably.

Reply View | 15 replies
- warkdarrior 3 days ago
  
  Is it possible that your expectations are increasing, not that the model is getting worse?
  
  Reply View | 3 replies
  
  GoatInGrey 3 days ago
  
  Possible, though you eventually run into types of issues that you recall the model just not having before. Like accessing a database or not following the SOP you have it read each time it performs X routine task. There are also patterns that are much less ambiguous like getting caught in loops or failing to execute a script it wrote after ten attempts.
  
  Reply View | 1 reply
  
  merlindru 3 days ago
  
  yes but i keep wondering if that's just the game of chance doing its thing
  like these models are nondeterministic right? (besides the fact that rng things like top k selection and temperature exist)
  say with every prompt there is 2% odds the AI gets it massively wrong. what if i had just lucked out the past couple weeks and now i had a streak of bad luck?
  and since my expectations are based on its previous (lucky) performance i now judge it even though it isn't different?
  or is it giving you consistenly worse performance, not able to get it right even after clearing context and trying again, on the exact same problem etc?
  
  Reply View | 0 replies
  
  F7F7F7 3 days ago
  
  I’ve had Opus struggle on trivial things that Sonnet 3.5 handled with ease.
  It’s not so much that the implementations are bad because the code is bad (the code is bad). It’s that it gets extremely confused and starts to frantically make worse and worse decisions and questioning itself. Editing multiple files, changing its mind and only fixing one or two. Reseting and overriding multiple batches of commits without so much as a second thought and losing days of work (yes, I’ve learned my lesson).
  It, the model, can’t even reason with the decisions it’s making from turn to turn. And the more opaque agentic help it’s getting the more I suspect that tasks are being routed to much lesser models (not the ones we’ve chosen via /model or those in our agent definitions) however Anthropic chooses.
  In these moments I mind as well be using Haiku.
  
  Reply View | 0 replies
- davidee 3 days ago
  
  I have to concur. And to the question about understanding what its good and bad at; no, tasks that it could accomplish quickly and easily just a month ago, now require more detailed prompting and constant "erroneous direction correction."
  It's almost as if, as tool use and planning capabilities have expanded, Claude (as a singular product) is having a harder time coming up with simple approaches that just work, instead trying to use tools and patterns that complicate things substantially and introduce much more room for errors/errors of assumption.
  It also regularly forgets its guidelines now.
  I can't tell you how many times it's suggested significant changes/refactors to functions because it suddenly forgets we're working in an FP codebase and suggests inappropriate imperative solutions as "better" (often choosing to use language around clarity/consistency when the solutions are neither).
  Additionally, it has started taking "initiative" in ways it did not before, attempting to be helpful but without gathering the context needed to do so properly when stepping outside the instruction set. It just ends up being much messier and inaccurate.
  I have to regularly just clear my prompt and start again with guardrails that have either: already been established, or have not been needed previously / are only a result of the over-zealousness of the work its attempting to complete.
  
  Reply View | 3 replies
  
  conception 3 days ago
  
  I assume, after any compacting of the context window that the session is more or less useless at that point I’ve never had consistent results after compacting.
  
  Reply View | 1 reply
  
  justinlivi 3 days ago
  
  Compacting equals death of the session in my process. I do everything I can to avoid hitting it. If I accidentally fly too close to the sun and compact I tend to revert and start fresh. As soon as it compacts it's basically useless
  
  Reply View | 0 replies
  
  F7F7F7 3 days ago
  
  Multiple concurrences a choir or a mob?
  1pm EST time it’s all down hill until around 8 or 9pm EST time.
  Late nights and weekends is smooth sailing.
  
  Reply View | 0 replies
- bushbaba 2 days ago
  
  I’m finding Gemini and chatGPT web terminal to out perform Claude code. The context becomes too much for the LLM, and tries to make up for it by doing more file read ops.
  
  Reply View | 1 reply
  
  samusiam 2 days ago
  
  Sounds like you might want to refactor the code if the individual files are too big and it can't find what it's looking for?
  
  Reply View | 0 replies
- emp17344 3 days ago
  
  Any chance you’re just learning more about what the model is and is not useful for?
  
  Reply View | 4 replies
  
  jerf 3 days ago
  
  I dunno about everyone else but when I learn more about what a model is and is not useful for, my subjective experience improves, not degrades.
  
  Reply View | 1 reply
  
  emp17344 3 days ago
  
  Not when the product is marketed as a panacea.
  
  Reply View | 0 replies
  
  data-ottawa 3 days ago
  
  There are some days where it acts staggeringly bad, beyond baselines.
  But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…
  There’s too many variables and no hard evidence shared by Anthropic.
  
  Reply View | 0 replies
  
  acuozzo 3 days ago
  
  No because switching to the API with the same prompt immediately fixes it.
  There's little incentive to throttle the API. It's $/token.
  
  Reply View | 0 replies
TIPSIO 3 days ago

I too suspect the A/B testing is the prime suspect: context window limits, system prompts, MAYBE some other questionable things that should be disclosed.
Either way, if true, given the cost I wish I could opt-out or it were more transparent.
Put out variants you can select and see which one people flock to. I and many others would probably test constantly and provide detailed feedback.
All speculation though

Reply View | 2 replies
- F7F7F7 3 days ago
  
  Whenever I see new behaviors and suspect I’m being tested on I’ll typically see a feedback form at some point in that session. Well, that and dropping four letter words.
  I know it’s more random sampling than not. But they are definitely using our codebases (and in some respects our livelihoods) as their guinea pigs.
  
  Reply View | 0 replies
- samusiam 2 days ago
  
  If that's the case, then as a benchmark operator you'd want to run the benchmark through multiple different accounts on different machines to average over A/B test noise.
  
  Reply View | 0 replies
make3 3 days ago

It would be very easy for them to switch the various (compute) cost vs performance knobs down depending on load to maintain a certain latency; you would see oscillations like this, especially if the benchmark is not always run exactly at the same time every day.
& it would be easy for them to start with a very costly inference setup for a marketing / reputation boost, and slowly turn the knobs down (smaller model, more quantized model, less thinking time, fewer MoE experts, etc)

Reply View | 0 replies
littlestymaar 3 days ago

> 1. The percentage drop is too low and oscillating, it goes up and down.
How do you define “too low”, they make sure to communicate about the statistical significance of their measurements, what's the point if people can just claim it's “too low” based on personal vibes…

Reply View | 0 replies
eterm 3 days ago

4. The graph starts January 8.
Why January 8? Was that an outlier high point?
IIRC, Opus 4.5 was released late november.

Reply View | 5 replies
- F7F7F7 3 days ago
  
  Right after the Holiday double token promotion users felt (perceived) a huge regression in capabilities. I bet that triggered the idea.
  
  Reply View | 0 replies
- pertymcpert 3 days ago
  
  People were away for the holidays. What do you want them to do?
  
  Reply View | 0 replies
- littlestymaar 3 days ago
  
  Or maybe, juste maybe, that's when they started testing…
  
  Reply View | 2 replies
  
  eterm 3 days ago
  
  Wayback machine has nothing for this site before today, and article is "last updated Jan 29".
  A benchmark like this ought to start fresh from when it is published.
  I don't entirely doubt the degradation, but the choice of where they went back to feels a bit cherry-picked to demonstrate the value of the benchmark.
  
  Reply View | 1 reply
  
  littlestymaar 3 days ago
  
  Which makes sense, you gotta wait until you get enough data before you can communicate on the said data…
  If anything it's coherent with the fact that they very likely didn't have data earlier than January the 8th.
  
  Reply View | 0 replies

crazygringo 3 days ago

> We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.

They're going to need to provide a lot more detail on their methodology, because that doesn't make a lot of sense. From their graphs, they seem to be calculating the confidence interval around the previous value, then determining whether the new value falls outside of it. But that's not valid for establishing the statistical significance of a difference. You need to calculate the confidence interval of the difference itself, and then see if all the values within that confidence interval remain positive (if it excludes 0). This is because both the old and new measurement have uncertainty. Their approach seems to be only considering uncertainty for one of them.

They should also really be more specific about the time periods. E.g. their graphs only show performance over the past 30 days, but presumably the monthly change is comparing the data from 60 to 31 days ago, to the data from 30 days ago until yesterday? In which case the weekly graph really ought to be displaying the past two months, not one month.