Comment by Tepix

Comment by Tepix 8 hours ago

Using lmstudio-community/Qwen3-Coder-Next-GGUF:Q8_0 I'm getting up to 32 tokens/s on Strix Halo, with room for 128k of context (out of 256k that the model can manage).

From very limited testing, it seems to be slightly worse than MiniMax M2.1 Q6 (a model about twice its size). I'm impressed.

dimgl 8 hours ago

How's the Strix Halo? I'd really like to get a local inference machine so that I don't have to use quantized versions of local models.

Reply View 2 replies

evilduck 6 hours ago

Works great for these type of MOE models. The ability to have large amounts of VRAM let you run different models in parallel easily, or to have actually useful context sizes. Dense models can get sluggish though. AMD's ROCm support has been a little rough for Stable Diffusion stuff (memory issues leading to application stability problems) but it's worked well with LLMs, as does Vulkan.
I wish AMD would get around to adding NPU support in Linux for it though, it has more potential that could be unlocked.

Reply View | 0 replies
Tepix 6 hours ago

Prompt preprocessing is slow, the rest is pretty great.

Reply View | 0 replies

cmrdporcupine 8 hours ago

I'm getting similar numbers on NVIDIA Spark around 25-30 tokens/sec output, 251 token/sec prompt processing... but I'm running with the Q4_K_XL quant. I'll try the Q8 next, but that would leave less room for context.

I tried FP8 in vLLM and it used 110GB and then my machine started to swap when I hit it with a query. Only room for 16k context.

I suspect there will be some optimizations over the next few weeks that will pick up the performance on these type of machines.

I have it writing some Rust code and it's definitely slower than using a hosted model but it's actually seeming pretty competent. These are the first results I've had on a locally hosted model that I could see myself actually using, though only once the speed picks up a bit.

I suspect the API providers will offer this model for nice and cheap, too.

Reply View 2 replies

aseipp 8 hours ago

llama.cpp is giving me ~35tok/sec with the unsloth quants (UD-Q4_K_XL, elsewhere in this thread) on my Spark. FWIW my understanding and experience is that llama.cpp seems to give slight better performance for "single user" workloads, but I'm not sure why.
I'm asking it to do some analysis/explain some Rust code in a rather large open source project and it's working nicely. I agree this is a model I could possibly, maybe use locally...

Reply View | 1 reply
- cmrdporcupine 7 hours ago
  
  Yeah I got 35-39tok/sec for one shot prompts, but for real-world longer context interactions through opencode it seems to be averaging out to 20-30tok/sec. I tried both MXFP4 and Q4_K_XL, no big difference, unfortunately.
  --no-mmap --fa on options seemed to help, but not dramatically.
  As with everything Spark, memory bandwidth is the limitation.
  I'd like to be impressed with 30tok/sec but it's sort of a "leave it overnight and come back to the results" kind of experience, wouldn't replace my normal agent use.
  However I suspect in a few days/weeks DeepInfra.com and others will have this model (maybe Groq, too?), and will serve it faster and for fairly cheap.
  
  Reply View | 0 replies