Comment by c4pt0r

Comment by c4pt0r 3 days ago

Paired with programming tools like Claude Code, it could be a low-cost/open-source replacement for Sonnet

Here's a neat looking project that allows for using other models with Claude Code: https://github.com/musistudio/claude-code-router

I found that while looking for reports of the best agents to use with K2. The usual suspects like Cline and forks, Aider, and Zed should be interesting to test with K2 as well.

Reply View 0 replies

martin_ 3 days ago

how do you low cost run a 1T param model?

Reply View 18 replies

maven29 3 days ago

32B active parameters with a single shared expert.

Reply View | 17 replies
- JustFinishedBSG 3 days ago
  
  This doesn’t change the VRAM usage, only the compute requirements.
  
  Reply View | 16 replies
  
  selfhoster11 3 days ago
  
  It does not have to be VRAM, it could be system RAM, or weights streamed from SSD storage. Reportedly, the latter method achieves around 1 token per second on computers with 64 GB of system RAM.
  R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.
  If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.
  
  Reply View | 7 replies
  
  maven29 3 days ago
  
  You can probably run this on CPU if you have a 4090D for prompt processing, since 1TB of DDR4 only comes out to around $600.
  For GPU inference at scale, I think token-level batching is used.
  
  Reply View | 7 replies

kkzz99 3 days ago

According to the bench its closer to Opus, but I venture primarily for English and Chinese.

Reply View 0 replies