Comment by chazeon

Comment by chazeon a day ago

9 replies

Well the seemingly cheap comes with significantly degraded performance, particular for agentic use. Have you tried replacing Claude Code with some locally deployed model, say, on 4090 or 5090? I have. It is not usable.

nylonstrung a day ago

Deepseek and Kimi both have great agentic performance

When used with crush/opencode they are close to Claude performance.

Nothing that runs on a 4090 would compete but Deepseek on openrouter is still 25x cheaper than claude

  • Aeolun a day ago

    > Deepseek on openrouter is still 25x cheaper than claude

    Is it? Or only when you don’t factor in Claude cached context? I’ve consistently found it pointless to use open models because the price of the good ones is so close to cached context on Claude that I don’t need them.

    • joefourier a day ago

      Deepseek via their API also has cached context, although the tokens/s was much lower than Claude when I tried it. But for background agents the price difference makes it absolutely worth it.

    • ewoodrich 21 hours ago

      Yes, if you try using Kilo Code/Cline via Openrouter the cost will be much cheaper using Deepseek/Kimi vs Claude Sonnet 4.5.

estsauver a day ago

Well, those are also extremely limited vram areas that wouldn't be able to run anything in the ~70b parameter space. (Can you run 30b even?)

Things get a lot more easier at lower quantisation, higher parameter space, and there's a lot of people's whose jobs for AI are "Extract sentiment from text" or "bin into one of these 5 categories" where that's probably fine.

elif a day ago

Strictly speaking, you have not deployed any model on a 5090 because a 5090 card has never been produced.

And without specifying your quantization level it's hard to know what you mean by "not usable"

Anyway if you really wanted to try cheap distilled/quantized models locally you would be using used v100 Teslas and not 4 year old single chip gaming GPUs.

JosephjackJR 20 hours ago

they took the already ridiculous v3.1 terminus model, added this new deepseek sparse attention thing, and suddenly it’s doing 128k context at basically half the inference cost of the old version with no measurable drop in reasoning or multilingual quality. like, imo gold medal level math and code, 100+ languages, all while sipping tokens at 14 cents per million input. that’s stupid cheap. the rl recipe they used this time also seems way more stable. no more endless repetition loops or random language switching you sometimes got with the earlier open models. it just works. what really got me is how fast the community moved. vllm support landed the same day, huggingface space was up in hours, and people are already fine-tuning it for agent stuff and long document reasoning. i’ve been playing with it locally and the speed jump on long prompts is night and day. feels like the gap to the closed frontier models just shrank again. anyone else tried it yet?