Comment by embedding-shape

Comment by embedding-shape 2 days ago

6 replies

> I run it all the time, token generation is pretty good.

I feel like because you didn't actually talk about prompt processing speed or token/s, you aren't really giving the whole picture here. What is the prompt processing tok/s and the generation tok/s actually like?

storus 2 days ago

I addressed both points - I mentioned you can offload token prefill (the slow part, 9t/s) to DGX Spark. Token generation is at 6t/s which is acceptable.

  • embedding-shape a day ago

    6 tok/sec might be acceptable for a dense model that doesn't do thinking, but for something like DeepSeek 3.2 that does do reasoning, 6 tok/sec isn't acceptable for anything else but async/batched stuff, sadly. Even for a response with just 100 tokens we're talking a minute for it to just write the response, for anything except the smallest of prompts you'll easily be hitting 1000 tokens (600 seconds!).

    Maybe my 6000 Pro spoiled me, but for actual usage, 6 or even 9 tok/sec is too slow for a reasoning/thinking model. To be honest, kind of expected on CPU though. I guess it's cool that it can run on Apple hardware, but it isn't exactly a pleasant experience at least today.

    • storus a day ago

      Dunno, DeepSeek on MacStudio doesn't feel much slower than when using it directly on deepseek.com; 6t/s is still around 24 characters per second which is faster than many people could read. I also have 6000 Pro but you won't fit any large model there and to be able to run DeepSeek R1/3.1/3.2 671B at Q4 you'd need 5-6 of them depending on the communication overhead. MacStudio is the simplest solution to run it locally.

      • embedding-shape 19 hours ago

        > 6t/s is still around 24 characters per second which is faster than many people could read.

        But again, not if you're using thinking/reasoning, which if you want to use this specific model properly, you are. Then you have a huge delay before the actual response comes through.

        > MacStudio is the simplest solution to run it locally.

        Obviously, that's Apple's core value proposition after all :) One does not acquire a state-of-the-art GPU and then expect simple stuff, especially when it's a fairly uncommon and new one. You cannot really be afraid of diving into CUDA code and similar fun rabbit holes. Simply two very different audiences for the two alternatives, and the Apple way is the simpler one, no doubt about it.

  • redrove a day ago

    6t/s will have you pulling your hair out with any deepseek model.