Comment by websiteapi

Comment by websiteapi 2 days ago

29 replies

I get tempted to buy a couple of these, but I just feel like the amortization doesn’t make sense yet. Surely in the next few years this will be orders of magnitude cheaper.

NitpickLawyer 2 days ago

Before committing to purchasing two of these, you should look at the true speeds that few people post. Not just the "it works". We're at a point where we can run these very large models "at home", and it is great! But true usage is now with very large contexts, both in prompt processing, and token generations. Whatever speeds these models get at "0" context is very different than what they get at "useful" context, especially in coding and such.

  • solarkraft 2 days ago

    Are there benchmarks that effectively measure this? This is essential information when speccing out an inference system/model size/quantization type.

  • cubefox 2 days ago

    DeepSeek-v3.2 should be be better for long context because it is using (near linear) sparse attention.

stingraycharles 2 days ago

I don’t think it will ever make sense; you can buy so much cloud based usage for this type of price.

From my perspective, the biggest problem is that I am just not going to be using it 24/7. Which means I’m not getting nearly as much value out of it as the cloud based vendors do from their hardware.

Last but not least, if I want to run queries against open source models, I prefer to use a provider like Groq or Cerebras as it’s extremely convenient to have the query results nearly instantly.

  • websiteapi 2 days ago

    my issue is once you have it in your workflow I'd be pretty latency sensitive. imagine those record-it-all apps working well. eventually you'd become pretty reliant on it. I don't want to necessarily be at the whims of the cloud

    • stingraycharles 2 days ago

      Aren’t those “record it all” applications implemented as a RAG and injected into the context based on embedding similarity?

      Obviously you’re not going to always inject everything into the context window.

  • [removed] 2 days ago
    [deleted]
  • lordswork 2 days ago

    As long as you're willing to wait up to an hour for your GPU to get scheduled when you do want to use it.

    • stingraycharles 2 days ago

      I don’t understand what you’re saying. What’s preventing you from using eg OpenRouter to run a query against Kimi-K2 from whatever provider?

      • hu3 2 days ago

        and you'll get a faster model this way

      • bgwalter 2 days ago

        Because you have Cloudflare (MITM 1), Openrouter (MITM 2) and finally the "AI" provider who can all read, store, analyze and resell your queries.

        EDIT: Thanks for downvoting what is literally one of the most important reasons for people to use local models. Denying and censoring reality does not prevent the bubble from bursting.

        • irthomasthomas a day ago

          you can use chutes.ai TEE (Trusted Execution Environment) and Kimi K2 is running at about 100t/s rn

  • givinguflac 2 days ago

    I think you’re missing the whole point, which is not using cloud compute.

    • stingraycharles 2 days ago

      Because of privacy reasons? Yeah I’m not going to spend a small fortune for that to be able to use these types of models.

      • givinguflac 2 days ago

        There are plenty of examples and reasons to do so besides privacy- because one can, because it’s cool, for research, for fine tuning, etc. I never mentioned privacy. Your use case is not everyone’s.

chrsw 2 days ago

The only reason why you run local models is for privacy, never for cost. Or even latency.

  • websiteapi 2 days ago

    indeed - my main use case is those kind of "record everything" sort of setups. I'm not even super privacy conscious per se but it just feels too weird to send literally everything I'm saying all of the time to the cloud.

    luckily for now whisper doesn't require too much compute, bu the kind of interesting analysis I'd want would require at least a 1B parameter model, maybe 100B or 1T.

    • nottorp 2 days ago

      > t just feels too weird to send literally everything I'm saying all of the time to the cloud

      ... or your clients' codebases ...

  • andy99 2 days ago

    Autonomy generally, not just privacy. You never know what the future will bring, AI will be enshittified and so will hubs like huggingface. It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

    • Aurornis 2 days ago

      > You never know what the future will bring, AI will be enshittified and so will hubs like huggingface.

      If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.

      > It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

      You can pay cloud providers for access to the same models that you can run locally, though. You don’t need a local setup even for this unlikely future scenario where all of the mainstream LLM providers simultaneously decided to make their LLMs poor quality and none of them sees this as market opportunity to provide good service.

      But even if we ignore all of that and assume that all of the cloud inference everywhere becomes bad at the same time at some point in the future, you would still be better off buying your own inference hardware at that point in time. Spending the money to buy two M3 Ultras right now to prepare for an unlikely future event is illogical.

      The only reason to run local LLMs is if you have privacy requirements or you want to do it as a hobby.

      • CamperBob2 2 days ago

        If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.

        OK. How do we set up this wager?

        I'm not knowledgeable about online gambling or prediction markets, but further enshittification seems like the world's safest bet.

    • chrsw 2 days ago

      Yes, I agree. And you can add security to that too.

alwillis 2 days ago

Hopefully the next time it’s updated, it should ship with some variant of the M5.

amelius 2 days ago

Maybe wait until RAM prices have normalized again.

segmondy 2 days ago

This is a weird line of thinking. Here's a question. If you buy one of these and figure out how to use it to make $100k in 3 months, would that be good? When you run a local model, you shouldn't compare it to to cost of using an API. The value lies in how you use it. Let's forget bout making money. Let's just say you have weird fetish and like to have dirty sexy conversation with your LLM. How much would you pay for your data not to be leaked and for the world to see your chat? Perhaps having your own private LLM makes it all worth it. If you have nothing special going then by all means use APIs, but if you feel/know your input it special, then yeah, go private.