Comment by embedding-shape

6 tok/sec might be acceptable for a dense model that doesn't do thinking, but for something like DeepSeek 3.2 that does do reasoning, 6 tok/sec isn't acceptable for anything else but async/batched stuff, sadly. Even for a response with just 100 tokens we're talking a minute for it to just write the response, for anything except the smallest of prompts you'll easily be hitting 1000 tokens (600 seconds!).

Maybe my 6000 Pro spoiled me, but for actual usage, 6 or even 9 tok/sec is too slow for a reasoning/thinking model. To be honest, kind of expected on CPU though. I guess it's cool that it can run on Apple hardware, but it isn't exactly a pleasant experience at least today.

storus a day ago

Dunno, DeepSeek on MacStudio doesn't feel much slower than when using it directly on deepseek.com; 6t/s is still around 24 characters per second which is faster than many people could read. I also have 6000 Pro but you won't fit any large model there and to be able to run DeepSeek R1/3.1/3.2 671B at Q4 you'd need 5-6 of them depending on the communication overhead. MacStudio is the simplest solution to run it locally.

Reply View 1 reply

embedding-shape 21 hours ago

> 6t/s is still around 24 characters per second which is faster than many people could read.
But again, not if you're using thinking/reasoning, which if you want to use this specific model properly, you are. Then you have a huge delay before the actual response comes through.
> MacStudio is the simplest solution to run it locally.
Obviously, that's Apple's core value proposition after all :) One does not acquire a state-of-the-art GPU and then expect simple stuff, especially when it's a fairly uncommon and new one. You cannot really be afraid of diving into CUDA code and similar fun rabbit holes. Simply two very different audiences for the two alternatives, and the Apple way is the simpler one, no doubt about it.

Reply View | 0 replies