Comment by dust42

Comment by dust42 3 days ago

On a Macbook pro 64GB I use Qwen3-Coder-30B-A3B Q4 quant with llama.cpp.

For VSCode I use continue.dev as it allows to set my own (short) system prompt. I get around 50token/sec generation and prompt processing 550t/s.

When giving well defined small tasks, it is as good as any frontier model.

I like the speed and low latency and the availability while on the plane/train or off-grid.

Also decent FIM with the llama.cpp VSCode plugin.

If I need more intelligence my personal favourites are Claude and Deepseek via API.

redblacktree 3 days ago

Would you use a different quant with a 128 GB machine? Could you link the specific download you used on huggingface? I find a lot of the options there to be confusing.

Reply View 2 replies

dust42 3 days ago

I usually use unsloth quants, in this case https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-... - the Q4_K_M variant.
On 128GB I would definitely run a larger model, probably with ~10B active parameters. All depends how many tokens per second is comfortable for you.
To get an idea of the speed difference, there is a benchmark page for llama.cpp on Apple silicon here: https://github.com/ggml-org/llama.cpp/discussions/4167
About quant selection: https://gist.github.com/Artefact2/b5f810600771265fc1e3944228...
And my workaround for 'shortening' prompt processing time: I load the files I want to work on (usually 1-3) into context with the instruction: read the code and wait. And while the LLM is doing the prompt processing I write my instructions of what I want to have done. Usually the LLM is long finished with PP before I am finished with writing instructions. Due to KV caching the LLM then gives almost instantly the answer.

Reply View | 0 replies
tommy_axle 3 days ago

Not the OP but yes you can definitely get a bigger quant like Q6 if it makes a difference but you also can go with a bigger param model like gpt oss 120B. A 70B would probably be great for a 128GB machine, which I don't think qwen has. You can search for the model you're interested in on hugging face often with "gguf" to get it ready to go (e.g. https://huggingface.co/ggml-org/gpt-oss-120b-GGUF/tree/main). Otherwise it's not a big deal to quantize yourself using llama.cpp.

Reply View | 0 replies

Xenograph 3 days ago

Have you tried continue.dev's new open completion model [1]? How does it compare to llama.vscode FIM with qwen?

[1] https://blog.continue.dev/instinct/

Reply View 0 replies

codingbear 3 days ago

how are you running qwen3 with llama-vscode? I am still using qwen-2.5-7b.

There is an open issue about adding support for Qwn3 which I have been monitoring, would love to use Qwen3 if possible. Issue - https://github.com/ggml-org/llama.vscode/issues/55

Reply View 0 replies