Comment by JustFinishedBSG

This doesn’t change the VRAM usage, only the compute requirements.

It does not have to be VRAM, it could be system RAM, or weights streamed from SSD storage. Reportedly, the latter method achieves around 1 token per second on computers with 64 GB of system RAM.

R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.

If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.

Reply View 7 replies

refulgentis 3 days ago

The amount of people who will be using it at 1 token/sec because there's no better option, and have 64 GB of RAM, is vanishingly small.
IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience.
* the implication being, anything is possible in that scenario

Reply View | 6 replies
- selfhoster11 2 days ago
  
  Good. Vanishingly small is still more than zero. Over time, running such models will become easier too, as people slowly upgrade to better hardware. It's not like there aren't options for the compute-constrained either. There are lots of Chinese models in the 3-32B range, and Gemma 3 is particularly good too.
  I will also point out that having three API-based providers deploying an impractically-large open-weights model beats the pants of having just one. Back in the day, this was called second-sourcing IIRC. With proprietary models, you're at the mercy of one corporation and their Kafkaesque ToS enforcement.
  
  Reply View | 3 replies
  
  refulgentis 2 days ago
  
  You said "Good." then wrote a nice stirring bit about how having a bad experience with a 1T model will force people to try 4B/32B models.
  That seems separate from the post it was replying to, about 1T param models.
  If it is intended to be a reply, it hand waves about how having a bad experience with it will teach them to buy more expensive hardware.
  Is that "Good."?
  The post points out that if people are taught they need an expensive computer to get 1 token/second, much less try it and find out it's a horrible experience (let's talk about prefill), it will turn them off against local LLMs unnecessarily.
  Is that "Good."?
  
  Reply View | 2 replies
- homarp 3 days ago
  
  agentic loop can run all night long. It's just a different way to work: prepare your prompt queue, set it up, check result in the morning, adjust. 'local vibe' in 10h instead of 10mn is still better than 10 days of manual side coding.
  
  Reply View | 1 reply
  
  hereme888 2 days ago
  
  Right on! Especially if its coding abilities are better than Claude 4 Opus. I spent thousands on my PC in anticipation of this rather than to play fancy video games.
  Now, where's that spare SSD...
  
  Reply View | 0 replies

maven29 3 days ago

You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.

For GPU inference at scale, I think token-level batching is used.

Reply View 7 replies

zackangelo 3 days ago

Typically a combination of expert level parallelism and tensor level parallelism is used.
For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).

Reply View | 0 replies
t1amat 3 days ago

With 32B active parameters it would be ridiculously slow at generation.

Reply View | 5 replies
- selfhoster11 3 days ago
  
  DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to.
  Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.
  
  Reply View | 4 replies
  
  CamperBob2 2 days ago
  
  That's pretty good. Are you running the real 600B+ parameter R1, or a distill, though?
  
  Reply View | 3 replies