Comment by redrove

Comment by redrove 2 days ago

I wouldn’t say runs. More of a gentle stroll.

storus 2 days ago

I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.

Reply View 7 replies

embedding-shape 2 days ago

> I run it all the time, token generation is pretty good.
I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?

Reply View | 6 replies
- storus 2 days ago
  
  I addressed both points - I mentioned you can offload token prefill (the slow part, 9t/s) to DGX Spark. Token generation is at 6t/s which is acceptable.
  
  Reply View | 5 replies
  
  embedding-shape a day ago
  
  6 tok/sec might be acceptable for a dense model that doesn't do thinking, but for something like DeepSeek 3.2 that does do reasoning, 6 tok/sec isn't acceptable for anything else but async/batched stuff, sadly. Even for a response with just 100 tokens we're talking a minute for it to just write the response, for anything except the smallest of prompts you'll easily be hitting 1000 tokens (600 seconds!).
  Maybe my 6000 Pro spoiled me, but for actual usage, 6 or even 9 tok/sec is too slow for a reasoning/thinking model. To be honest, kind of expected on CPU though. I guess it's cool that it can run on Apple hardware, but it isn't exactly a pleasant experience at least today.
  
  Reply View | 2 replies
  
  redrove a day ago
  
  6t/s will have you pulling your hair out with any deepseek model.
  
  Reply View | 0 replies
  
  a96 a day ago
  
  So, quarter stroll.
  
  Reply View | 0 replies

hasperdi 2 days ago

With quantization, converting it to an MOE model... it can be a fast walk

Reply View 0 replies