Comment by refulgentis

Comment by refulgentis 2 days ago

7 replies

I write a local LLM client, but sometimes, I hate that local models have enough knobs to turn that people can advocate they're reasonable in any scenario - in yesterday's post re: Kimi k2, multiple people spoke up that you can "just" stream the active expert weights out of 64 GB of RAM, and use the lowest GGUF quant, and then you get something that rounds to 1 token/s, and that is reasonable for use.

Good on you for not exaggerating.

I am very curious what exactly they see in that, 2-3 people hopped in to handwave that you just have it do agent stuff overnight and it's well worth it. I can't even begin to imagine unless you have a metric **-ton of easily solved problems that aren't coding. Even a 90% success rate gets you into "useless" territory quick when one step depends on the other, and you're running it autonomoously for hours

segmondy a day ago

I do deepseek at 5tk/sec at home and I'm happy with it. I don't need to do agent stuff to gain from it, I was saving to eventually build out enough to run it at 10tk/sec, but with kimi k2, plan has changed and the savings continue with a goal to run it at 5 tk/sec at home.

  • fzzzy a day ago

    I agree, 5 tokens per second is plenty fast for casual use.

    • overfeed a day ago

      Also works perfectly fine in fire-and-forget, non-interactive agentic workflows. My dream scenario is that I create a bunch of kanban tickets and assign them to one or more AI personas[1], and wake up to some Pull Requests the next morning. I'd me more concerned about tickets-per-day, and not tk/s as I have no interest in watching the inner-workings of the model.

      1. Some more creative than others, with slightly different injected prompts or perhaps even different models entirely.

      • numpad0 a day ago

        > I create a bunch of kanban tickets and assign them to one or more AI personas[1],

        Yeah that. Why can't we just `find ./tasks/ | grep \.md$ | xargs llm`. Can't we just write up a government proposal style document, have LLM recursively down into sub-sub-projects and back up until the original proposal document can be translated into a completion report. Constantly correcting a humongous LLM with infinite context length that can keep everything in its head doesn't feel like the right approach.

        • londons_explore a day ago

          In my experience, this sort of thing nearly works... But never quite works well enough and errors and misunderstandings build at every stage and the output is garbage.

          Maybe with bigger models it'll work well.

    • SV_BubbleTime 17 hours ago

      It was, last year 5tk/s was reasonable. If you wanted to proof read a paragraph or rewrite some bullet points into a PowerPoint slide.

      Now, with agentic coding, thinking models, a “chat with my pdf” or whatever artifacts are being called now, no, I don’t think 5/s is enough.

    • refulgentis a day ago

      Cosign for chat, that's my bar for usable on mobile phone (and correlates well with avg. reading speed)