Comment by summarity

Comment by summarity 2 days ago

20 replies

Reasonable speeds are possible with 4bit quants on 2 512GB Mac Studios (MLX TB4 Ring - see https://x.com/awnihannun/status/1943723599971443134) or even a single socket Epyc system with >1TB of RAM (about the same real world memory throughput as the M Ultra). So $20k-ish to play with it.

For real-world speeds though yeah, you'd need serious hardware. This is more of a "deploy your own stamp" model, less a "local" model.

wongarsu a day ago

Reasonable speeds are possible if you pay someone else to run it. Right now both NovitaAI and Parasail are running it, both available through Openrouter and both promising not to store any data. I'm sure the other big model hosters will follow if there's demand.

I may not be able to reasonably run it myself, but at least I can choose who I trust to run it and can have inference pricing determined by a competitive market. According to their benchmarks the model is about in a class with Claude 4 Sonet, yet already costs less than one third of Sonet's inference pricing

  • winter_blue a day ago

    I’m actually finding Claude 4 Sonnet’s thinking model to be too slow to meet my needs. It literally takes several minutes per query on Cursor.

    So running it locally is the exact opposite of what I’m looking for.

    Rather, I’m willing to pay more, to have it be run on a faster than normal cloud inference machine.

    Anthropic is already too slow.

    Since this model is open source, maybe someone could offer it at a “premium” pay per use price, where the response rate / inference is done a lot faster, with more resources thrown at it.

    • terhechte a day ago

      Anthropic isn't slow. I'm running Claude Max and it's pretty fast. The problem is that Cursor slowed down their responses in order to optimize their costs. At least a ton of people are experiencing this.

    • satvikpendem 21 hours ago

      > It literally takes several minutes per query on Cursor.

      There's your issue. Use Claude Code or the API directly and compare the speeds. Cursor is slowing down requests to maintain costs.

gpm 2 days ago

> or even a single socket Epyc system with >1TB of RAM

How many tokens/second would this likely achieve?

  • chithanh 7 hours ago

    KTransformers now supports Kimi K2 for MoE offloading

    They claim 14 tps for the 4-bit quant on a single socket system with 600 GB RAM and 14 GB GPU memory.

  • [removed] 2 days ago
    [deleted]
  • kachapopopow a day ago

    around 1 by the time you try to do anything useful with it (>10000 tokens)

refulgentis 2 days ago

I write a local LLM client, but sometimes, I hate that local models have enough knobs to turn that people can advocate they're reasonable in any scenario - in yesterday's post re: Kimi k2, multiple people spoke up that you can "just" stream the active expert weights out of 64 GB of RAM, and use the lowest GGUF quant, and then you get something that rounds to 1 token/s, and that is reasonable for use.

Good on you for not exaggerating.

I am very curious what exactly they see in that, 2-3 people hopped in to handwave that you just have it do agent stuff overnight and it's well worth it. I can't even begin to imagine unless you have a metric **-ton of easily solved problems that aren't coding. Even a 90% success rate gets you into "useless" territory quick when one step depends on the other, and you're running it autonomoously for hours

  • segmondy 2 days ago

    I do deepseek at 5tk/sec at home and I'm happy with it. I don't need to do agent stuff to gain from it, I was saving to eventually build out enough to run it at 10tk/sec, but with kimi k2, plan has changed and the savings continue with a goal to run it at 5 tk/sec at home.

    • fzzzy 2 days ago

      I agree, 5 tokens per second is plenty fast for casual use.

      • overfeed 2 days ago

        Also works perfectly fine in fire-and-forget, non-interactive agentic workflows. My dream scenario is that I create a bunch of kanban tickets and assign them to one or more AI personas[1], and wake up to some Pull Requests the next morning. I'd me more concerned about tickets-per-day, and not tk/s as I have no interest in watching the inner-workings of the model.

        1. Some more creative than others, with slightly different injected prompts or perhaps even different models entirely.

      • refulgentis 2 days ago

        Cosign for chat, that's my bar for usable on mobile phone (and correlates well with avg. reading speed)

      • SV_BubbleTime a day ago

        It was, last year 5tk/s was reasonable. If you wanted to proof read a paragraph or rewrite some bullet points into a PowerPoint slide.

        Now, with agentic coding, thinking models, a “chat with my pdf” or whatever artifacts are being called now, no, I don’t think 5/s is enough.