Comment by anonym29
>Claude code runs into use limitations for everyone at every tier
What do you mean by this? I know plenty of people who never hit the upgraded Opus 4.5 limits anymore even on the $100 plan, even those who used to hit the limits on the $200 plan w/ Opus 4 and Opus 4.1.
>The API is too expensive to use and it's _still_ subsidized.
What do you mean by saying the API is subsidized? Anthropic is a private company that isn't required to (and doesn't) report detailed public financial statements. The company operating at a loss doesn't mean all inference is operating at a loss, it means that the company is spending an enormous amount of money on R&D. The fact that the net loss is shrinking over time tells us that the inference is producing net profit over time. In this business, there is enormous up front cost to train a model. That model then goes on to generate initially large, but subsequently gradually diminishing revenue until the model is deprecated. That said, at any given snapshot-in-time, while there is likely large ongoing R&D expenditure on the next model causing the overall net profit for the entire company to still be negative, it's entirely possible that several, if not many or even most of the previously trained models have fully recouped their training costs in inference revenue.
It's fairly obvious that the monthly subscriptions are subsidized to gain market share the same way Uber rides were on early on, but what indication do you have that the PAYG API is being subsidized? How would total losses have shrunk from $5.6B in 2024 to just $3B in 2025 while ARR grew from ~$1B to ~$7B over the same time period (one where usage of the platform dramatically expanded) if PAYG API inference wasn't running at a net profit for the company?
>quadratic attention means LLMs will always cost astronomically more than you expect after running the pilot project
This is only true as long as O(n²) quadratic attention remains the prevailing paradigm. As Qwen3-Next and Nemotron 3 Nano have shown with hybrid linear attention + sparse quadratic layers and a hybrid Mamba SSM, not all modern, performant LLMs necessarily need to run strictly O(n²) quadratic attention models. Sure, these aren't frontier models competitive with Opus 4.5 or Gemini 3 Pro or GPT 5.2 xhigh, but these aren't experimental tiny toy models like RWKV or Falcon Mamba that serve as little more than PoCs for alternative architectures, either. Qwen3-Next and Nemotron 3 Nano are solid players in their respective local weight classes.
Nemotron 3 is amazing. 60 tokens/s on my 128GB Nvidia GB10, and actually emits some pretty reasonable "smart" content" for its size.