Comment by TIPSIO

Comment by TIPSIO 2 days ago

It's awesome that stuff like this is open source, but even if you have a basement rig with 4 NVIDIA GeForce RTX 5090 graphic cards ($15-20k machine), can it even run with any reasonable context window that isn't like a crawling 10/tps?

Frontier models are far exceeding even the most hardcore consumer hobbyist requirements. This is even further

tarruda 2 days ago

You can run at ~20 tokens/second on a 512GB Mac Studio M3 Ultra: https://youtu.be/ufXZI6aqOU8?si=YGowQ3cSzHDpgv4z&t=197

IIRC the 512GB mac studio is about $10k

Reply View 11 replies

menaerus 21 hours ago

~20 tokens/second is actually pretty good. I see he's using the q5 version of the model. I wonder how it scales with the larger contexts. And the same guy published the video today with the new 3.2 version: https://www.youtube.com/watch?v=b6RgBIROK5o

Reply View | 0 replies
hasperdi 2 days ago

and can be faster if you can get an MOE model of that

Reply View | 9 replies
- dormento 2 days ago
  
  "Mixture-of-experts", AKA "running several small models and activating only a few at a time". Thanks for introducing me to that concept. Fascinating.
  (commentary: things are really moving too fast for the layperson to keep up)
  
  Reply View | 4 replies
  
  hasperdi 2 days ago
  
  As pointed out by a sibling comment. MOE consists of a router and a number of experts (eg 8). These experts can be imagined as parts of the brain with specialization, although in reality they probably don't work exactly like that. These aren't separate models, they are components of a single large model.
  Typically, input gets routed to a number of of experts eg. top 2, leaving the others inactive. This reduces number of activation / processing requirements.
  Mistral is an example of a model that's designed like this. Clever people created converters to transform dense models to MOE models. These days many popular models are also available in MOE configuration
  
  Reply View | 0 replies
  
  whimsicalism 2 days ago
  
  that's not really a good summary of what MoEs are. you can more consider it like sublayers that get routed through (like how the brain only lights up certain pathways) rather than actual separate models.
  
  Reply View | 2 replies
- miohtama 2 days ago
  
  All modern models are MoE already, no?
  
  Reply View | 1 reply
  
  hasperdi a day ago
  
  That's not the case. Some are dense and some are hybrid.
  MOE is not the holy grail, as there are drawbacks eg. less consistency, expert under/over-use
  
  Reply View | 0 replies
- tarruda a day ago
  
  Deepseek is already a MoE
  
  Reply View | 0 replies
- bigyabai 2 days ago
  
  >90% of inference hardware is faster if you run an MOE model.
  
  Reply View | 0 replies

noosphr 2 days ago

Home rigs like that are no longer cost effective. You're better off buying an rtx pro 6000 outright. This holds both for the sticker price, the supporting hardware price, the electricity cost to run it and cooling the room that you use it in.

Reply View 24 replies

torginus 2 days ago

I was just watching this video about a Chinese piece of industrial equipment, designed for replacing BGA chips such as flash or RAM with a good deal of precision:
https://www.youtube.com/watch?v=zwHqO1mnMsA
I wonder how well the aftermarket memory surgery business on consumer GPUs is doing.

Reply View | 11 replies
- dotancohen 2 days ago
  
  I wonder how well the opthalmologist is doing. These guys are going to be paying him a visit playing around with those lasers and no PPE.
  
  Reply View | 9 replies
  
  CamperBob2 2 days ago
  
  Eh, I don't see the risk, no pun intended. It's not collimated, and it's not going to be in focus anywhere but on-target. It's also probably in the long-wave range >>1000 nm that's not focused by the eye. At the end of the day it's no different from any other source of spot heating. I get more nervous around some of the LED flashlights you can buy these days.
  I want one. Hot air blows.
  
  Reply View | 8 replies
- ThrowawayTestr 2 days ago
  
  LTT recently did a video on upgrading a 5090 to 96gb of ram
  
  Reply View | 0 replies
throw4039 2 days ago

Yeah, the pricing for the rtx pro 6000 is surprisingly competitive with the gamer cards (at actual prices, not MSRP). A 3x5090 rig will require significant tuning/downclocking to be run from a single North American 15A plug, and the cost of the higher powered supporting equipment (cooling, PSU, UPS, etc) needed will pay for the price difference, not to mention future expansion possibilities.

Reply View | 0 replies
mikae1 2 days ago

Or perhaps a 512GB Mac Studio. 671B Q4 of R1 runs on it.

Reply View | 10 replies
- redrove 2 days ago
  
  I wouldn’t say runs. More of a gentle stroll.
  
  Reply View | 9 replies
  
  storus 2 days ago
  
  I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.
  
  Reply View | 7 replies
  
  hasperdi 2 days ago
  
  With quantization, converting it to an MOE model... it can be a fast walk
  
  Reply View | 0 replies

reilly3000 2 days ago

There are plenty of 3rd party and big cloud options to run these models by the hour or token. Big models really only work in that context, and that’s ok. Or you can get yourself an H100 rack and go nuts, but there is little downside to using a cloud provider on a per-token basis.

Reply View 8 replies

cubefox 2 days ago

> There are plenty of 3rd party and big cloud options to run these models by the hour or token.
Which ones? I wanted to try a large base model for automated literature (fine-tuned models are a lot worse at it) but I couldn't find a provider which makes this easy.

Reply View | 7 replies
- reilly3000 2 days ago
  
  If you’re already using GCP, Vertex AI is pretty good. You can run lots of models on it:
  https://docs.cloud.google.com/vertex-ai/generative-ai/docs/m...
  Lambda.ai used to offer per-token pricing but they have moved up market. You can still rent a B200 instance for sub $5/hr which is reasonable for experimenting with models.
  https://app.hyperbolic.ai/models Hyperbolic offers both GPU hosting and token pricing for popular OSS models. It’s easy with token based options because usually are a drop-in replacement for OpenAI API endpoints.
  You have you rent a GPU instance if you want to run the latest or custom stuff, but if you just want to play around for a few hours it’s not unreasonable.
  
  Reply View | 2 replies
  
  verdverm 2 days ago
  
  GCloud and Hyperbolic have been my go-to as well
  
  Reply View | 0 replies
  
  cubefox a day ago
  
  > If you’re already using GCP, Vertex AI is pretty good. You can run lots of models on it:
  > https://docs.cloud.google.com/vertex-ai/generative-ai/docs/m...
  I don't see any large base models there. A base model is a pretrained foundation model without fine tuning. It just predicts text.
  > Lambda.ai used to offer per-token pricing but they have moved up market. You can still rent a B200 instance for sub $5/hr which is reasonable for experimenting with models.
  A B200 is probably not enough: it has just 192 GB RAM while DeepSeek-V3.2-Exp-Base, the base model for DeepSeek-V3.2, has 685 billion BF16 parameters. Though I assume they have larger options. The problem is that all the configuration work is then left to the user, which I'm not experienced in.
  > https://app.hyperbolic.ai/models Hyperbolic offers both GPU hosting and token pricing for popular OSS models
  Thanks. They do indeed have a single base model: Llama 3.1 405B BASE. This one is a bit older (July 2024) and probably not as good as the base model for the new DeepSeek release. But that might the the best one can do, as there don't seem to be any inference providers which have deployed a DeepSeek or even Kimi base model.
  
  Reply View | 0 replies
- weberer a day ago
  
  Fireworks supports this model serverless for $1.20 per million tokens.
  https://fireworks.ai/models/fireworks/deepseek-v3p2
  
  Reply View | 1 reply
  
  cubefox 21 hours ago
  
  That's the final, fine-tuned model. The base model (pretraining only, no instruction SFT, RLHF, RLVR etc) is this one: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp-Base It's apparently not offered at any inference provider, nor are older DeepSeek base models.
  
  Reply View | 0 replies
- big_man_ting 2 days ago
  
  have you checked OpenRouter if they offer any providers who serve the model you need?
  
  Reply View | 1 reply
  
  cubefox a day ago
  
  I searched for "base" and the best available base model seems to be indeed Llama 3.1 405B Base at Hyperbolic.ai, as mentioned in the comment above.
  
  Reply View | 0 replies

halyconWays 2 days ago

As someone with a basement rig of 6x 3090s, not really. It's quite slow, as with that many params (685B) it's offloading basically all of it into system RAM. I limit myself to models with <144B params, then it's quite an enjoyable experience. GLM 4.5 Air has been great in particular

Reply View 1 reply

lostmsu a day ago

Did you find it better than GPT-OSS 120B? The public rankings are contradictory.

Reply View | 0 replies

seanw265 2 days ago

FWIW it looks like OpenRouter's two providers for this model (one of whom being Deepseek itself) are only running the model around 28tps at the moment.

https://openrouter.ai/deepseek/deepseek-v3.2

This only bolsters your point. Will be interesting to see if this changes as the model is adopted more widely.

Reply View 1 reply

[removed] 2 days ago

[deleted]

Reply View | 0 replies

bigyabai 2 days ago

People with basement rigs generally aren't the target audience for these gigantic models. You'd get much better results out of an MoE model like Qwen3's A3B/A22B weights, if you're running a homelab setup.

Reply View 3 replies

Aachen a day ago

Who is the target audience of these free releases? I don't mind free and open information sharing but I have wondered what's in it for the people that spent unholy amounts of energy on scraping, developing, and training

Reply View | 0 replies
Spivak 2 days ago

Yeah I think the advantage of OSS models is that you can get your pick of providers and aren't locked into just Anthropic or just OpenAI.

Reply View | 1 reply
- hnfong 2 days ago
  
  Reproducibility of results are also important in some cases.
  There are consumer-ish hardware that can run large models like DeepSeek 3.x slowly. If you're using LLMs for a specific purpose that is well-served by a particular model, you don't want to risk AI companies deprecating it in a couple months and push you to a newer model (that may or may not work better in your situation).
  And even if the AI service providers nominally use the same model, you might have cases where reproducibility requires you use the same inference software or even hardware to maintain high reproducibility of the results.
  If you're just using OpenAI or Anthropic you just don't get that level of control.
  
  Reply View | 0 replies

potsandpans 2 days ago

I run a bunch of smaller models on a 12gb vram 3060 and it's quite good. For larger open models ill use open router. I'm looking into on- demand instances with cloud/vps providers, but haven't explored the space too much.

I feel like private cloud instances that run on demand is still in the spirit of consumer hobbyist. It's not as good as having it all local, but the bootstrapping cost plus electricity to run seems prohibitive.

I'm really interested to see if there's a space for consumer TPUs that satisfy usecases like this.

Reply View 1 reply

wickedsight a day ago

Which ones are your favorites that fit on the 3060?

Reply View | 0 replies