Comment by maven29

Comment by maven29 3 days ago

You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.

For GPU inference at scale, I think token-level batching is used.

zackangelo 3 days ago

Typically a combination of expert level parallelism and tensor level parallelism is used.

For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).

Reply View 0 replies

t1amat 3 days ago

With 32B active parameters it would be ridiculously slow at generation.

Reply View 5 replies

selfhoster11 3 days ago

DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to.
Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.

Reply View | 4 replies
- CamperBob2 2 days ago
  
  That's pretty good. Are you running the real 600B+ parameter R1, or a distill, though?
  
  Reply View | 3 replies
  
  selfhoster11 21 hours ago
  
  The full thing, 671B. It loses some intelligence at 1.5 bit quantisation, but it's acceptable. I could actually go for around 3 bits if I max out my RAM, but I haven't done that yet.
  
  Reply View | 1 reply
  
  apitman 20 hours ago
  
  I've seen people say the models get more erratic at higher (lower?) quantization levels. What's your experience been?
  
  Reply View | 0 replies
  
  [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies