Comment by mikae1

Comment by mikae1 2 days ago

Or perhaps a 512GB Mac Studio. 671B Q4 of R1 runs on it.

redrove 2 days ago

I wouldn’t say runs. More of a gentle stroll.

storus 2 days ago

I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.

Reply View | 7 replies
- embedding-shape 2 days ago
  
  > I run it all the time, token generation is pretty good.
  I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?
  
  Reply View | 6 replies
  
  storus 2 days ago
  
  I addressed both points - I mentioned you can offload token prefill (the slow part, 9t/s) to DGX Spark. Token generation is at 6t/s which is acceptable.
  
  Reply View | 5 replies
hasperdi 2 days ago

With quantization, converting it to an MOE model... it can be a fast walk

Reply View | 0 replies