Comment by gedy

Comment by gedy 2 days ago

7 replies

Sorry if this is an easy-answerable question - but by open we can download this and use totally offline if now or in the future if we have hardware capable? Seems like a great thing to archive if the world falls apart (said half-jokingly)

fancy_pantser a day ago

Sure. Someone on /r/LocalLLaMA was seeing 12.5 tokens/s on dual Strix Halo 128GB machines (run you $6-8K total?) with 1.8bits per parameter. It performs far below the unquantized model, so it would not be my personal pick for a one-local-LLM-forever, but it is compelling because it has image and video understanding. You lose those features if you choose, say, gpt-oss-120B.

Also, that's with no context, so it would be slower as it filled (I don't think K2.5 uses the Kimi-Linear KDA attention mechanism, so it's sub-quadratic but not their lowest).

Tepix 2 days ago

You could buy five Strix Halo systems at $2000 each, network them and run it.

Rough estimage: 12.5:2.2 so you should get around 5.5 tokens/s.

  • j-bos 2 days ago

    Is the software/drivers for networking LLMs on Strix Halo there yet? I was under the impression a few weeks ago that it's veeeery early stages and terribly slow.

fragmede 2 days ago

Yes but the hardware to run it decently gonna cost you north of $100k, so hopefully you and your bunkermates allocated the right amount to this instead of guns or ammo.