Comment by embedding-shape
Comment by embedding-shape a day ago
6 tok/sec might be acceptable for a dense model that doesn't do thinking, but for something like DeepSeek 3.2 that does do reasoning, 6 tok/sec isn't acceptable for anything else but async/batched stuff, sadly. Even for a response with just 100 tokens we're talking a minute for it to just write the response, for anything except the smallest of prompts you'll easily be hitting 1000 tokens (600 seconds!).
Maybe my 6000 Pro spoiled me, but for actual usage, 6 or even 9 tok/sec is too slow for a reasoning/thinking model. To be honest, kind of expected on CPU though. I guess it's cool that it can run on Apple hardware, but it isn't exactly a pleasant experience at least today.
Dunno, DeepSeek on MacStudio doesn't feel much slower than when using it directly on deepseek.com; 6t/s is still around 24 characters per second which is faster than many people could read. I also have 6000 Pro but you won't fit any large model there and to be able to run DeepSeek R1/3.1/3.2 671B at Q4 you'd need 5-6 of them depending on the communication overhead. MacStudio is the simplest solution to run it locally.