Comment by tedivm

Comment by tedivm 2 days ago

19 replies

If you're trying to build AI based applications you can and should compare the costs between vendor based solutions and hosting open models with your own hardware.

On the hardware side you can run some benchmarks on the hardware (or use other people's benchmarks) and get an idea of the tokens/second you can get from the machine. Normalize this for your usage pattern (and do your best to implement batch processing where you are able to, which will save you money on both methods) and you have a basic idea of how much it would cost per token.

Then you compare that to the cost of something like GPT5, which is a bit simpler because the cost per (million) token is something you can grab off of a website.

You'd be surprised how much money running something like DeepSeek (or if you prefer a more established company, Qwen3) will save you over the cloud systems.

That's just one factor though. Another is what hardware you can actually run things on. DeepSeek and Qwen will function on cheap GPUs that other models will simply choke on.

miki123211 a day ago

> with your own hardware

Or with somebody else's.

If you don't have strict data residency requirements, and if you aren't doing this at an extremely large scale, doing it on somebody else's hardware makes much more economic sense.

If you use MoE models (al modern >70B models are MoE), GPU utilization increases with batch size. If you don't have enough requests to keep GPUs properly fed 24/7, those GPUs will end up underutilized.

Sometimes underutilization is okay, if your system needs to be airgapped for example, but that's not an economics discussion any more.

Unlike e.g. video streaming workloads, LLMs can be hosted on the other side of the world from where the user is, and the difference is barely going to be noticeable. This means you can keep GPUs fed by bringing in workloads from other timezones when your cluster would otherwise be idle. Unless you're a large, worldwide organization, that is difficult to do if you're using your own hardware.

  • embedding-shape 20 hours ago

    > If you use MoE models (al modern >70B models are MoE), GPU utilization increases with batch size

    Isn't that true for any LLM, MoE or not? In fact, doesn't that apply to most concepts within ML, as long as it's possible to do batching at all, you can scale it up and utilize more of the GPU, until you saturate some part of the process.

AlexCoventry a day ago

Mixture-of-Expert models benefit from economies of scale, because they can process queries in parallel, and expect different queries to hit different experts at a given layer. This leads to higher utilization of GPU resources. So unless your application is already getting a lot of use, you're probably under-utilizing your hardware.

Muromec 2 days ago

>That's just one factor though. Another is what hardware you can actually run things on. DeepSeek and Qwen will function on cheap GPUs that other models will simply choke on.

What's cheap nowdays? I'm out of the loop. Does anything ever run on integrated AMD that is Ryzen AI that comes in framework motherboards? Is under 1k americans cheap?

  • GTP a day ago

    Not really in the loop either, but when Deepseek R1 was released, I sumbled upon this YouTube channel [1] that made local AI PC builds in the 1000-2000$ range. But he doesn't always use GPUs, maybe the cheaper builds were CPU plus a lot of RAM, I don't remember.

    [1] https://youtube.com/@digitalspaceport?si=NrZL7MNu80vvAshx

    • District5524 a day ago

      Digital Spaceport is a really good channel, I second that - the author is not sparing any detail. The cheaper options always use CPU only, or sharding between different cheap GPUs (without SLI/switching) - which is not good for all use cases (he also highlights this). But some his prices are one-off bargains for used stuff. And RAM prices doubled this year, so you won't buy 2x256 GB DDR4 for $336, no matter what: https://digitalspaceport.com/500-deepseek-r1-671b-local-ai-s...

    • baq a day ago

      'lots of RAM' got expensive lately -_-

chazeon a day ago

Well the seemingly cheap comes with significantly degraded performance, particular for agentic use. Have you tried replacing Claude Code with some locally deployed model, say, on 4090 or 5090? I have. It is not usable.

  • nylonstrung a day ago

    Deepseek and Kimi both have great agentic performance

    When used with crush/opencode they are close to Claude performance.

    Nothing that runs on a 4090 would compete but Deepseek on openrouter is still 25x cheaper than claude

    • Aeolun a day ago

      > Deepseek on openrouter is still 25x cheaper than claude

      Is it? Or only when you don’t factor in Claude cached context? I’ve consistently found it pointless to use open models because the price of the good ones is so close to cached context on Claude that I don’t need them.

      • joefourier a day ago

        Deepseek via their API also has cached context, although the tokens/s was much lower than Claude when I tried it. But for background agents the price difference makes it absolutely worth it.

      • ewoodrich 21 hours ago

        Yes, if you try using Kilo Code/Cline via Openrouter the cost will be much cheaper using Deepseek/Kimi vs Claude Sonnet 4.5.

  • estsauver a day ago

    Well, those are also extremely limited vram areas that wouldn't be able to run anything in the ~70b parameter space. (Can you run 30b even?)

    Things get a lot more easier at lower quantisation, higher parameter space, and there's a lot of people's whose jobs for AI are "Extract sentiment from text" or "bin into one of these 5 categories" where that's probably fine.

  • elif a day ago

    Strictly speaking, you have not deployed any model on a 5090 because a 5090 card has never been produced.

    And without specifying your quantization level it's hard to know what you mean by "not usable"

    Anyway if you really wanted to try cheap distilled/quantized models locally you would be using used v100 Teslas and not 4 year old single chip gaming GPUs.

  • JosephjackJR 19 hours ago

    they took the already ridiculous v3.1 terminus model, added this new deepseek sparse attention thing, and suddenly it’s doing 128k context at basically half the inference cost of the old version with no measurable drop in reasoning or multilingual quality. like, imo gold medal level math and code, 100+ languages, all while sipping tokens at 14 cents per million input. that’s stupid cheap. the rl recipe they used this time also seems way more stable. no more endless repetition loops or random language switching you sometimes got with the earlier open models. it just works. what really got me is how fast the community moved. vllm support landed the same day, huggingface space was up in hours, and people are already fine-tuning it for agent stuff and long document reasoning. i’ve been playing with it locally and the speed jump on long prompts is night and day. feels like the gap to the closed frontier models just shrank again. anyone else tried it yet?

kmacdough 20 hours ago

Furthermore, paid models are heavily subsidized by bullish investors playing for monopoly. So that tips the scales further towards Deepseek.

qeternity 2 days ago

> DeepSeek and Qwen will function on cheap GPUs that other models will simply choke on.

Uh, Deepseek will not (unless you are referring to one of their older R1 finetuned variants). But any flagship Deepseek model will require 16x A100/H100+ with NVL in FP8.