storus 2 days ago

I run it all the time, token generation is pretty good. Just large contexts are slow but you can hook a DGX Spark via Exo Labs stack and outsource token prefill to it. Upcoming M5 Ultra should be faster than Spark in token prefill as well.

  • embedding-shape 2 days ago

    > I run it all the time, token generation is pretty good.

    I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?

    • storus 2 days ago

      I addressed both points - I mentioned you can offload token prefill (the slow part, 9t/s) to DGX Spark. Token generation is at 6t/s which is acceptable.

      • embedding-shape a day ago

        6 tok/sec might be acceptable for a dense model that doesn't do thinking, but for something like DeepSeek 3.2 that does do reasoning, 6 tok/sec isn't acceptable for anything else but async/batched stuff, sadly. Even for a response with just 100 tokens we're talking a minute for it to just write the response, for anything except the smallest of prompts you'll easily be hitting 1000 tokens (600 seconds!).

        Maybe my 6000 Pro spoiled me, but for actual usage, 6 or even 9 tok/sec is too slow for a reasoning/thinking model. To be honest, kind of expected on CPU though. I guess it's cool that it can run on Apple hardware, but it isn't exactly a pleasant experience at least today.

      • redrove a day ago

        6t/s will have you pulling your hair out with any deepseek model.

hasperdi 2 days ago

With quantization, converting it to an MOE model... it can be a fast walk