Comment by danielhanchen

Comment by danielhanchen 11 hours ago

For those interested, made some Dynamic Unsloth GGUFs for local deployment at https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF and made a guide on using Claude Code / Codex locally: https://unsloth.ai/docs/models/qwen3-coder-next

genpfault 9 hours ago

Nice! Getting ~39 tok/s @ ~60% GPU util. (~170W out of 303W per nvtop).

System info:

    $ ./llama-server --version
    ggml_vulkan: Found 1 Vulkan devices:
    ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
    version: 7897 (3dd95914d)
    built with GNU 11.4.0 for Linux x86_64

llama.cpp command-line:

    $ ./llama-server --host 0.0.0.0 --port 2000 --no-warmup \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --jinja --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --fit on \
    --ctx-size 32768

Reply View 3 replies

halcyonblue 8 hours ago

What am I missing here? I thought this model needs 46GB of unified memory for 4-bit quant. Radeon RX 7900 XTX has 24GB of memory right? Hoping to get some insight, thanks in advance!

Reply View | 1 reply
- coder543 8 hours ago
  
  MoEs can be efficiently split between dense weights (attention/KV/etc) and sparse (MoE) weights. By running the dense weights on the GPU and offloading the sparse weights to slower CPU RAM, you can still get surprisingly decent performance out of a lot of MoEs.
  Not as good as running the entire thing on the GPU, of course.
  
  Reply View | 0 replies
danielhanchen 2 hours ago

Super cool! Also with `--fit on` you don't need `--ctx-size 32768` technically anymore - llama-server will auto determine the max context size!

Reply View | 0 replies

bityard 8 hours ago

Hi Daniel, I've been using some of your models on my Framework Desktop at home. Thanks for all that you do.

Asking from a place of pure ignorance here, because I don't see the answer on HF or in your docs: Why would I (or anyone) want to run this instead of Qwen3's own GGUFs?

Reply View 2 replies

danielhanchen 2 hours ago

Thanks! Oh Qwen3's own GGUFs also works, but ours are dynamically quantized and calibrated with a reasonably large diverse dataset, whilst Qwen's ones are not - see https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

Reply View | 1 reply
- bityard 2 hours ago
  
  I've read that page before and although it all certainly sounds very impressive, I'm not an AI researcher. What's the actual goal of dynamic quantization? Does it make the model more accurate? Faster? Smaller?
  
  Reply View | 0 replies

MrDrMcCoy 5 hours ago

Still hoping IQuest-Coder gets the same treatment :)

Reply View 0 replies

ranger_danger 11 hours ago

What is the difference between the UD and non-UD files?

Reply View 10 replies

danielhanchen 11 hours ago

UD stands for "Unsloth-Dynamic" which upcasts important layers to higher bits. Non UD is just standard llama.cpp quants. Both still use our calibration dataset.

Reply View | 9 replies
- CamperBob2 10 hours ago
  
  Please consider authoring a single, straightforward introductory-level page somewhere that explains what all the filename components mean, and who should use which variants.
  The green/yellow/red indicators for different levels of hardware support are really helpful, but far from enough IMO.
  
  Reply View | 6 replies
  
  danielhanchen 10 hours ago
  
  Oh good idea! In general UD-Q4_K_XL (Unsloth Dynamic 4bits Extra Large) is what I generally recommend for most hardware - MXFP4_MOE is also ok
  
  Reply View | 4 replies
  
  segmondy 9 hours ago
  
  The green/yellow/red indicators are based on what you set for your hardware on huggingface.
  
  Reply View | 0 replies
- ranger_danger 4 hours ago
  
  What is your definition of "important" in this context?
  
  Reply View | 1 reply
  
  danielhanchen 2 hours ago
  
  Oh we wrote about it here: https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
  
  Reply View | 0 replies

binsquare 11 hours ago

How did you do it so fast?

Great work as always btw!

Reply View 1 reply

danielhanchen 10 hours ago

Thanks! :) We're early access partners with them!

Reply View | 0 replies

CamperBob2 6 hours ago

Good results with your Q8_0 version on 96GB RTX 6000 Blackwell. It one-shotted the Flappy Bird game and also wrote a good Wordle clone in four shots, all at over 60 tps. Thanks!

Is your Q8_0 file the same as the one hosted directly on the Qwen GGUF page?

Reply View 1 reply

danielhanchen 2 hours ago

Nice! Yes Q8_0 is similar - the others are different since they use a calibration dataset.

Reply View | 0 replies