Comment by t1amat

Comment by t1amat 3 days ago

With 32B active parameters it would be ridiculously slow at generation.

DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to.

Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.

Reply View 4 replies

CamperBob2 2 days ago

That's pretty good. Are you running the real 600B+ parameter R1, or a distill, though?

Reply View | 3 replies
- selfhoster11 13 hours ago
  
  The full thing, 671B. It loses some intelligence at 1.5 bit quantisation, but it's acceptable. I could actually go for around 3 bits if I max out my RAM, but I haven't done that yet.
  
  Reply View | 1 reply
  
  apitman 11 hours ago
  
  I've seen people say the models get more erratic at higher (lower?) quantization levels. What's your experience been?
  
  Reply View | 0 replies
- [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies