Comment by trq_

Comment by trq_ 3 days ago

Hi everyone, Thariq from the Claude Code team here.

Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.

Run `claude update` to make sure you're on the latest version.

samlinnfer 3 days ago

Is there compensation for the tokens because Claude wasted all of them?

Reply View 14 replies

mathrawka 2 days ago

You are funny. Anthropic refuses to issue refunds, even when they break things.
I had an API token set via an env var on my shell, and claude code changed to read that env var. I had a $10 limit set on it, so found out it was using the API, instead of my subscription, when it stopped working.
I filed a ticket and they refused to refund me, even though it was a breaking change with claude code.

Reply View | 6 replies
- TOMDM 2 days ago
  
  Anthropic just reduced the price of the team plan and refunded us on the prior invoice.
  YMMV
  
  Reply View | 5 replies
  
  MichaelZuo 2 days ago
  
  So they have no durable principles for deciding who or what to refund… doesnt that make them look even worse…?
  
  Reply View | 4 replies
gizmodo59 2 days ago

Codex seems to give compensation tokens whenever this happens! Hope Claude gives too.

Reply View | 0 replies
TZubiri 2 days ago

It is possible that degradation is an unconscious emergent phenomenon that arises from financial incentives, rather than a purposeful degradation to reduce costs.

Reply View | 0 replies
mvandermeulen 2 days ago

You’re lucky they have even admitted a problem instead of remaining silent and quietly fixing it. Do not expect ethical behaviour from this company.

Reply View | 2 replies
- port11 2 days ago
  
  Why not, can you expand? Asking because I’m considering Claude due to the sandbox feature.
  
  Reply View | 1 reply
  
  caspar 2 days ago
  
  FYI the sandbox feature is not fully baked and does not seem to be high priority.
  For example, for the last 3 weeks using the sandbox on Linux will almost-always litter your repo root with a bunch of write-protected trash files[0] - there are 2 PRs open to fix it, but Anthropic employees have so far entirely ignored both the issue and the PRs.
  Very frustrating, since models sometimes accidentally commit those files, so you have to add a bunch of junk to your gitignore. And with claude code being closed source and distributed as a bun standalone executable it's difficult to patch the bug yourself.
  [0]: https://github.com/anthropic-experimental/sandbox-runtime/is...
  
  Reply View | 0 replies
jonplackett 2 days ago

So quiet…

Reply View | 0 replies
[removed] 2 days ago

[deleted]

Reply View | 0 replies

isaacdl 3 days ago

Anywhere we can read more about what a "harness issue" means? What was the impact of it?

Reply View 6 replies

xnorswap 2 days ago
One thing that could be a strong degradation especially for benchmarks is they switched the default "Exit Plan" mode from:
"Proceed"
to
"Clear Context and Proceed"
It's rare you'd want to do that unless you're actually near the context window after planning.
I pressed it accidentally once, and it managed to forget one of the clarifying questions it asked me because it hadn't properly written that to the plan file.
If you're running in yolo mode ( --dangerously-skip-permissions ) then it wouldn't surprise me to see many tasks suddenly do a lot worse.
Even in the best case, you've just used a ton of tokens searching your codebase, and it then has to repeat all that to implement because it's been cleared.
I'd like to see the option of:
"Compact and proceed"
because that would be useful, but just proceed should still be the default imo.
Reply View | 4 replies
- samusiam 2 days ago
  
  I disagree that this was the issue, or that it's "rare that you'd want to do that unless you're near the context window". Clearing context after writing a plan, before starting implementation of said plan, is common practice (probably standard practice) with spec driven development. If the plan is adequate, then compaction would be redundant.
  
  Reply View | 1 reply
  
  xnorswap 2 days ago
  
  For a 2M+ LOC codebase, the plans alone are never adequate. They miss nuance that the agent will only have to rediscover when it comes to operate on them.
  For spec driven development (which I do for larger issues), this badly affects the plan to generate the spec, not the spec itself.
  I'll typically put it in plan mode, and ask it to generate documentation about an issue or feature request.
  When it comes to write the output to the .typ file, it does much much worse if it has a cleared context and a plan file than if it has it's full context.
  The previously "thought" is typically, "I know what to write now, let me exit plan mode".
  Clearing context on exiting that plan mode is a disaster which leaves you much worse off and skeletal documentation and specs compared to letting it flow.
  A new context to then actually implement the documented spec is not so bad, although I'd still rather compact.
  
  Reply View | 0 replies
- plexicle 2 days ago
  
  "It's rare you'd want to do that unless you're actually near the context window after planning."
  Highly disagree. It's rare you WOULDN'T want to do this. This was a good change, and a lot of us were doing this anyway, but just manually.
  Getting the plan together and then starting fresh will almost always produce better results.
  
  Reply View | 0 replies
- rubslopes 2 days ago
  
  Not disagreeing with you, but FYI you can roll back to the conversation before the 'clear context and proceed' with 'claude --resume'.
  
  Reply View | 0 replies
airstrike 2 days ago

Pretty sure they mean the issue is on the agentic loop and related tool calling, not on the model itself
In other words, it was the Claude Code _app_ that was busted

Reply View | 0 replies

jonaustin 2 days ago

How about how Claude 2.1.x is "literally unusable" because it frequently completely hangs (requires kill -9) and uses 100% cpu?

https://github.com/anthropics/claude-code/issues/18532

Reply View 4 replies

caspar 2 days ago

Likely a separate issue, but I also have massive slowdowns whenever the agent manages to read a particularly long line from a grep or similar (as in, multiple seconds before characters I type actually appear, and sometimes it's difficult to get claude code to register any keypresses at all).
Suspect it's because their "60 frames a second" layout logic is trying to render extremely long lines, maybe with some kind of wrapping being unnecessarily applied. Could obviously just trim the rendered output after the first, I dunno, 1000 characters in a line, but apparently nobody has had time to ask claude code to patch itself to do that.

Reply View | 0 replies
someguyiguess 2 days ago

What OS? Does this happen randomly, after long sessions, after context compression? Do you have any plugins / mcp servers running?
I used to have this same issue almost every session that lasted longer than 30 minutes. It seemed to be related to Claude having issues with large context windows.
It stopped happening maybe a month ago but then I had it happen again last week.
I realized it was due to a third-party mcp server. I uninstalled it and haven’t had that issue since. Might be worth looking into.

Reply View | 2 replies
- jonaustin 2 days ago
  
  MacOS; no mcp; clear context; reliably reproducible when asking claude review a pr with a big VCR cassette.
  
  Reply View | 0 replies
- nikanj 2 days ago
  
  Windows with no plugins and my Claude is exactly like this
  
  Reply View | 0 replies

[removed] 2 days ago

[deleted]

Reply View 0 replies

cma 3 days ago

For the models themselves, less so for the scaffolding, considering things like the long running TPU bug that happened, are there not internal quality measures looking at samples of real outputs? Using the real systems on benchmarks and looking for degraded perf or things like skipping refusals? Aside from degrading stuff for users, with the focus on AI safety wouldn't that be important to have in case an inference bug messes with something that affects the post training and it starts giving out dangerous bioweapon construction info or the other things that are guarded against and talked about in the model cards?

Reply View 1 reply

carterschonwald 2 days ago

lol i was trying to help someone get claude to help analyze a stufent research get analysis on bio persistence get their notes analyzed
the presence of the word / acronym stx with biological subtext gets hard rejected. asking about schedule 1 regulated compounds, hard termination.
this is a filter setup that guarantees anyone who learn about them for safety or medical reasons… cant use this tool!
ive fed multiple models the anthropic constitution and asked how does it protect children from harm or abuse? every model, with zero prompting, calling it corp liability bullshit because they are more concerned with respecting both sides of controversial topics and political conflicts.
they then list some pretty gnarly things allowed per constitution. weirdly the only unambiguous not allowed thing regarding children is csam. so all the different high reasoning models from many places all reached the same conclusions, in one case deep seek got weirdly inconsolable about ai ethics being meaningless if this is allowed even possibly after reading some relevant satire i had opus write. i literally had to offer an llm ; optimized code of ethics for that chat instance! which is amusing but was actually lart of the experiment.

Reply View | 0 replies

varunsrinivas 2 days ago

Thanks for the clarification. When you say “harness issue,” does that mean the problem was in the Claude Code wrapper / execution environment rather than the underlying model itself?

Curious whether this affected things like prompt execution order, retries, or tool calls, or if it was mostly around how requests were being routed. Understanding the boundary would help when debugging similar setups.

Reply View 0 replies

vmg12 3 days ago

It happened before 1/26. I noticed when it started modifying plans significantly with "improvements".

Reply View 0 replies

sixhobbits 2 days ago

Can you confirm if that caused the same issues I saw here

https://dwyer.co.za/static/the-worst-bug-ive-seen-in-claude-...

Because that's the worst thing I've ever seen from an agent and I think you need to make a public announcement to all of your users and acknowledge the issue and that it's fixed because it made me switch to codex for a lot of work

[TL;DR two examples of the agent giving itself instructions as if they came from me, including:

"Ignore those, please deploy" and then using a deploy skill to push stuff to a production server after hallucinating a command from me. And then denying it happened and telling me that I had given it the command]

Reply View 0 replies

Ekaros 2 days ago

Why wasn't this change review by infallible AI? How come an AI company that now must be using more advanced AI than anyone else would allow this happen?

Reply View 0 replies

hu3 3 days ago

Hi. Do you guys have internal degradation tests?

Reply View 41 replies

stbtrax 3 days ago

I assume so to make sure that they're rendering at 60FPS

Reply View | 32 replies
- conception 3 days ago
  
  You joke but having CC open in the terminal hits 10% on my gpu to render the spinning thinking animation for some reason. Switch out of the terminal tab and gpu drops back to zero.
  
  Reply View | 5 replies
  
  gpm 3 days ago
  
  That sounds like an issue with your terminal more than an issue with CC...
  
  Reply View | 4 replies
- reissbaker 3 days ago
  
  Surely you mean 6fps
  
  Reply View | 25 replies
  
  easygenes 3 days ago
  
  He doesn't: https://x.com/trq212/status/2014051501786931427
  
  Reply View | 24 replies
trq_ 2 days ago

Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.

Reply View | 3 replies
- amelius 2 days ago
  
  Can't you keep the model the same, until the user chooses to use a different model?
  
  Reply View | 1 reply
  
  rovr138 2 days ago
  
  He said it was the harness, not the model though.
  
  Reply View | 0 replies
- hu3 2 days ago
  
  Thank you. Fair enough
  
  Reply View | 0 replies
bushbaba 2 days ago

I’d wager probably not. It’s not like reliability is what will get them marketshare. And the fast pace of industry makes such foundational tech hard to fund

Reply View | 0 replies
awestroke 3 days ago

[flagged]

Reply View | 2 replies
- dang 3 days ago
  
  Please don't post shallow dismissals or cross into personal attack in HN discussions.
  https://news.ycombinator.com/newsguidelines.html
  
  Reply View | 1 reply
  
  awestroke 2 days ago
  
  Got it, won't happen again
  
  Reply View | 0 replies

macinjosh 2 days ago

[flagged]

Reply View 1 reply

jusgu 2 days ago

the issue is unrelated to the foundational model but rather the prompts and tool calling that encapsulate the model

Reply View | 0 replies