Comment by websiteapi

Comment by websiteapi 2 days ago

I get tempted to buy a couple of these, but I just feel like the amortization doesn’t make sense yet. Surely in the next few years this will be orders of magnitude cheaper.

NitpickLawyer 2 days ago

Before committing to purchasing two of these, you should look at the true speeds that few people post. Not just the "it works". We're at a point where we can run these very large models "at home", and it is great! But true usage is now with very large contexts, both in prompt processing, and token generations. Whatever speeds these models get at "0" context is very different than what they get at "useful" context, especially in coding and such.

Reply View 2 replies

solarkraft 2 days ago

Are there benchmarks that effectively measure this? This is essential information when speccing out an inference system/model size/quantization type.

Reply View | 0 replies
cubefox 2 days ago

DeepSeek-v3.2 should be be better for long context because it is using (near linear) sparse attention.

Reply View | 0 replies

stingraycharles 2 days ago

I don’t think it will ever make sense; you can buy so much cloud based usage for this type of price.

From my perspective, the biggest problem is that I am just not going to be using it 24/7. Which means I’m not getting nearly as much value out of it as the cloud based vendors do from their hardware.

Last but not least, if I want to run queries against open source models, I prefer to use a provider like Groq or Cerebras as it’s extremely convenient to have the query results nearly instantly.

Reply View 13 replies

websiteapi 2 days ago

my issue is once you have it in your workflow I'd be pretty latency sensitive. imagine those record-it-all apps working well. eventually you'd become pretty reliant on it. I don't want to necessarily be at the whims of the cloud

Reply View | 1 reply
- stingraycharles 2 days ago
  
  Aren’t those “record it all” applications implemented as a RAG and injected into the context based on embedding similarity?
  Obviously you’re not going to always inject everything into the context window.
  
  Reply View | 0 replies
[removed] 2 days ago

[deleted]

Reply View | 0 replies
lordswork 2 days ago

As long as you're willing to wait up to an hour for your GPU to get scheduled when you do want to use it.

Reply View | 4 replies
- stingraycharles 2 days ago
  
  I don’t understand what you’re saying. What’s preventing you from using eg OpenRouter to run a query against Kimi-K2 from whatever provider?
  
  Reply View | 3 replies
  
  hu3 2 days ago
  
  and you'll get a faster model this way
  
  Reply View | 0 replies
  
  bgwalter 2 days ago
  
  Because you have Cloudflare (MITM 1), Openrouter (MITM 2) and finally the "AI" provider who can all read, store, analyze and resell your queries.
  EDIT: Thanks for downvoting what is literally one of the most important reasons for people to use local models. Denying and censoring reality does not prevent the bubble from bursting.
  
  Reply View | 1 reply
  
  irthomasthomas a day ago
  
  you can use chutes.ai TEE (Trusted Execution Environment) and Kimi K2 is running at about 100t/s rn
  
  Reply View | 0 replies
givinguflac 2 days ago

I think you’re missing the whole point, which is not using cloud compute.

Reply View | 4 replies
- stingraycharles 2 days ago
  
  Because of privacy reasons? Yeah I’m not going to spend a small fortune for that to be able to use these types of models.
  
  Reply View | 3 replies
  
  givinguflac 2 days ago
  
  There are plenty of examples and reasons to do so besides privacy- because one can, because it’s cool, for research, for fine tuning, etc. I never mentioned privacy. Your use case is not everyone’s.
  
  Reply View | 2 replies

chrsw 2 days ago

The only reason why you run local models is for privacy, never for cost. Or even latency.

Reply View 8 replies

websiteapi 2 days ago

indeed - my main use case is those kind of "record everything" sort of setups. I'm not even super privacy conscious per se but it just feels too weird to send literally everything I'm saying all of the time to the cloud.
luckily for now whisper doesn't require too much compute, bu the kind of interesting analysis I'd want would require at least a 1B parameter model, maybe 100B or 1T.

Reply View | 1 reply
- nottorp 2 days ago
  
  > t just feels too weird to send literally everything I'm saying all of the time to the cloud
  ... or your clients' codebases ...
  
  Reply View | 0 replies
andy99 2 days ago

Autonomy generally, not just privacy. You never know what the future will bring, AI will be enshittified and so will hubs like huggingface. It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

Reply View | 5 replies
- Aurornis 2 days ago
  
  > You never know what the future will bring, AI will be enshittified and so will hubs like huggingface.
  If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.
  > It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.
  You can pay cloud providers for access to the same models that you can run locally, though. You don’t need a local setup even for this unlikely future scenario where all of the mainstream LLM providers simultaneously decided to make their LLMs poor quality and none of them sees this as market opportunity to provide good service.
  But even if we ignore all of that and assume that all of the cloud inference everywhere becomes bad at the same time at some point in the future, you would still be better off buying your own inference hardware at that point in time. Spending the money to buy two M3 Ultras right now to prepare for an unlikely future event is illogical.
  The only reason to run local LLMs is if you have privacy requirements or you want to do it as a hobby.
  
  Reply View | 3 replies
  
  CamperBob2 2 days ago
  
  If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.
  OK. How do we set up this wager?
  I'm not knowledgeable about online gambling or prediction markets, but further enshittification seems like the world's safest bet.
  
  Reply View | 2 replies
- chrsw 2 days ago
  
  Yes, I agree. And you can add security to that too.
  
  Reply View | 0 replies

alwillis 2 days ago

Hopefully the next time it’s updated, it should ship with some variant of the M5.

Reply View 0 replies

amelius 2 days ago

Maybe wait until RAM prices have normalized again.

Reply View 0 replies

segmondy 2 days ago

This is a weird line of thinking. Here's a question. If you buy one of these and figure out how to use it to make $100k in 3 months, would that be good? When you run a local model, you shouldn't compare it to to cost of using an API. The value lies in how you use it. Let's forget bout making money. Let's just say you have weird fetish and like to have dirty sexy conversation with your LLM. How much would you pay for your data not to be leaked and for the world to see your chat? Perhaps having your own private LLM makes it all worth it. If you have nothing special going then by all means use APIs, but if you feel/know your input it special, then yeah, go private.

Reply View 0 replies