Quantized, heavily, and offloading everything possible to sysram. You can run it this way, just barely reachable with consumer hardware with 16 to 24gb vram and 256gb sysram. Before the spike in prices you could just about build such a system for $2500, but the ram along probably adds another $2k onto that now. Nvidia dgx boxes and similar setups with 256gb unified ram can probably manage it more slowly ~1-2 tokens per second. Unsloth has the quantized models. I’ve test Kimi though don’t have quite the headroom at home for it, and I don’t yet see a significant enough difference between it and the Qwen 3 models that can run in more modest setups: I get a highly usable 50 tokens per second out of the A3B instruct that fits into 16gb VRAM with enough left over not to choke Netflix and other browser tasks, it performs on par with what I ask out of Haiku in Claude Code, and better as my own tweaking improves with the also ever better tooling that comes out near weekly.
Quantized, heavily, and offloading everything possible to sysram. You can run it this way, just barely reachable with consumer hardware with 16 to 24gb vram and 256gb sysram. Before the spike in prices you could just about build such a system for $2500, but the ram along probably adds another $2k onto that now. Nvidia dgx boxes and similar setups with 256gb unified ram can probably manage it more slowly ~1-2 tokens per second. Unsloth has the quantized models. I’ve test Kimi though don’t have quite the headroom at home for it, and I don’t yet see a significant enough difference between it and the Qwen 3 models that can run in more modest setups: I get a highly usable 50 tokens per second out of the A3B instruct that fits into 16gb VRAM with enough left over not to choke Netflix and other browser tasks, it performs on par with what I ask out of Haiku in Claude Code, and better as my own tweaking improves with the also ever better tooling that comes out near weekly.