Comment by rc1

Comment by rc1 2 days ago

How long until this can be run on consumer grade hardware or a domestic electricity supply I wonder.

Anyone have a projection?

You can run it on consumer grade hardware right now, but it will be rather slow. NVMe SSDs these days have a read speed of 7 GB/s (EDIT: or even faster than that! Thank you @hedgehog for the update), so it will give you one token roughly every three seconds while crunching through the 32 billion active parameters, which are natively quantized to 4 bit each. If you want to run it faster, you have to spend more money.

Some people in the localllama subreddit have built systems which run large models at more decent speeds: https://www.reddit.com/r/LocalLLaMA/

Reply View 3 replies

hedgehog 2 days ago

High end consumer SSDs can do closer to 15 GB/s, though only with PCI-e gen 5. On a motherboard with two m.2 slots that's potentially around 30GB/s from disk. Edit: How fast everything is depends on how much data needs to get loaded from disk which is not always everything on MoE models.

Reply View | 2 replies
- greenavocado a day ago
  
  Would RAID zero help here?
  
  Reply View | 1 reply
  
  hedgehog a day ago
  
  Yes, RAID 0 or 1 could both work in this case to combine the disks. You would want to check the bus topology for the specific motherboard to make sure the slots aren't on the other side of a hub or something like that.
  
  Reply View | 0 replies

segmondy 2 days ago

You can run it on a mac studio with 512gb ram, that's the easiest way. I run it at home on a multi rig GPU with partial offload to ram.

Reply View 1 reply

johndough 2 days ago

I was wondering whether multiple GPUs make it go appreciably faster when limited by VRAM. Do you have some tokens/sec numbers for text generation?

Reply View | 0 replies

heliumtera 2 days ago

You need 600gb of VRAM + MEMORY (+ DISK) to fit the model (full) or 240 for the 1b quantized model. Of course this will be slow.

Through moonshot api it is pretty fast (much much much faster than Gemini 3 pro and Claude sonnet, probably faster than Gemini flash), though. To get similar experience they say at least 4xH200.

If you don't mind running it super slow, you still need around 600gb of VRAM + fast RAM.

It's already possible to run 4xH200 in a domestic environment (it would be instantaneous for most tasks, unbelievable speed). It's just very very expensive and probably challenging for most users, manageable/easy for the average hacker news crowd.

Expensive AND hard to source high end GPUs, if you manage to source for the old prices around 200 thousand dollars to get maximum speed I guess, you could probably run decently on a bunch of high end machines, for let's say, 40k (slow).

Reply View 0 replies