Comment by fancy_pantser
Comment by fancy_pantser a day ago
Sure. Someone on /r/LocalLLaMA was seeing 12.5 tokens/s on dual Strix Halo 128GB machines (run you $6-8K total?) with 1.8bits per parameter. It performs far below the unquantized model, so it would not be my personal pick for a one-local-LLM-forever, but it is compelling because it has image and video understanding. You lose those features if you choose, say, gpt-oss-120B.
Also, that's with no context, so it would be slower as it filled (I don't think K2.5 uses the Kimi-Linear KDA attention mechanism, so it's sub-quadratic but not their lowest).