exitb 3 days ago

An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.

  • codeflo 3 days ago

    This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.

    • TedDallas 3 days ago

      Per Anthropic’s RCA linked in Ops post for September 2025 issues:

      “… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”

      So according to Anthropic they are not tweaking quality setting due to demand.

      • rootnod3 3 days ago

        And according to Google, they always delete data if requested.

        And according to Meta, they always give you ALL the data they have on you when requested.

      • cmrdporcupine 3 days ago

        I guess I just don't know how to square that with my actual experiences then.

        I've seen sporadic drops in reasoning skills that made me feel like it was January 2025, not 2026 ... inconsistent.

      • chrisjj 3 days ago

        That's about model quality. Nothing about output quality.

      • stefan_ 3 days ago

        Thats what is called an "overly specific denial". It sounds more palatable if you say "we deployed a newly quantized model of Opus and here are cherry picked benchmarks to show its the same", and even that they don't announce publicly.

      • [removed] 3 days ago
        [deleted]
    • mcny 3 days ago

      Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.

      Sure, I'll take a cup of coffee while I wait (:

      • lurking_swe 3 days ago

        i’d wait any amount of time lol.

        at least i would KNOW it’s overloaded and i should use a different model, try again later, or just skip AI assistance for the task altogether.

    • direwolf20 3 days ago

      They don't advertise a certain quality. You take what they have or leave it.

    • bpavuk 3 days ago

      > I think delivering lower quality than what was advertised and benchmarked is borderline fraud

      welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.

    • denysvitali 3 days ago

      If there's no way to check, then how can you claim it's fraud? :)

    • chrisjj 3 days ago

      There is no level of quality advertised, as far as I can see.

      • pseidemann 3 days ago

        What is "level of quality"? Doesn't this apply to any product?

        • chrisjj 3 days ago

          In this case, it is benchmark performance. See the root post.

  • sh3rl0ck 3 days ago

    I'd wager that lower tok/s vs lower quality of output would be two very different knobs to turn.

awestroke 3 days ago

I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load

vidarh 3 days ago

It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.

  • [removed] 3 days ago
    [deleted]
  • chrisjj 3 days ago

    They advertise the Opus 4.5 model. Secretly substituting a cheaper one to save costs would be fraud.

    • vidarh 3 days ago

      If you use the API, you pay for a specific model, yes, but even then there are "workarounds" for them, such as someone else pointed out by reducing the amount of time they let it "think".

      If you use the subscriptions, the terms specifically says that beyond the caps they can limit your "model and feature usage, at our discretion".

      • chrisjj 3 days ago

        Sure. I was separating the model - which Anthropic promises not to downgrade - and the "thinking time" - which Anthropic doesn't promise not to downgrade. It seems the latter is very likely the culprit in this case.

    • kingstnap 3 days ago

      Old school Gemini used to do this. It was super obvious because mid day the model would go from stupid to completely brain dead. I have a screenshot of Google's FAQ on my PC from 2024-09-13 that says this (I took it to post to discord):

      > How do I know which model Gemini is using in its responses?

      > We believe in using the right model for the right task. We use various models at hand for specific tasks based on what we think will provide the best experience.

      • chrisjj 3 days ago

        > We use various models at hand for specific tasks based on what we think will provide the best experience

        ... for Google :)

Wheaties466 3 days ago

from what I understand this can come from the batching of requests.

  • chrisjj 3 days ago

    So, a known bug?

    • embedding-shape 3 days ago

      No, basically, the requests are processed in batches, together, and the order they're listed in matters for the results, as the grid (tiles) that the GPU is ultimately processing, are different depending on what order they entered at.

      So if you want batching + determinism, you need the same batch with the same order which obviously don't work when there are N+1 clients instead of just one.

      • chrisjj 3 days ago

        Sure, but how can that lead to increased demand resulting in decreased intelligence? That is the effect we are discussing.