Comment by simonw

Comment by simonw 8 hours ago

4 replies

I got this running locally using llama.cpp from Homebrew and the Unsloth quantized model like this:

  brew upgrade llama.cpp # or brew install if you don't have it yet
Then:

  llama-cli \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja
That opened a CLI interface. For a web UI on port 8080 along with an OpenAI chat completions compatible endpoint do this:

  llama-server \
    -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL \
    --fit on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --jinja
It's using about 28GB of RAM.
nubg 6 hours ago

what's the token per seconds speed?

technotony 6 hours ago

what are your impressions?

  • simonw 4 hours ago

    I got Codex CLI running against it and was sadly very unimpressed - it got stuck in a loop running "ls" for some reason when I asked it to create a new file.

    • danielhanchen 2 hours ago

      Yes sadly that sometimes happens - the issue is Codex CLI / Claude Code were designed for GPT / Claude models specifically, so it'll be hard for OSS models directly to utilize the full spec / tools etc, and might get loops sometimes - I would maybe try the MXFP4_MOE quant to see if it helps, and maybe try Qwen CLI (was planning to make a guide for it as well)

      I guess until we see the day OSS models truly utilize Codex / CC very well, then local models will really take off