Comment by traktorn
Which model are you running locally? Is it faster than waiting for Claudes generation? What gear do you use?
Which model are you running locally? Is it faster than waiting for Claudes generation? What gear do you use?
Not OP but for autocomplete I am running Qwen2.5-Coder-7B and I quantized it using Q2_K. I followed this guide:
https://blog.steelph0enix.dev/posts/llama-cpp-guide/#quantiz...
And I get fast enough autcomplete results for it to be useful. I have and NVIDIA 4060 RTX in a laptop with 8 gigs of dedicated memory that I use for it. I still use claude for chat (pair programming) though, and I don't really use agents.
That's the fun part, you can use all of them! And you don't need to use browser plugins or console scripts to auto-retry failures (there aren't any) or queue up a ton of tasks overnight.
Have a 3950X w/ 32GB ram, Radeon VII & 6900XT sitting in the closet hosting smaller models then a 5800X3D/128GB/7900XTX as my main machine.
Most any quantized model that fits in half of the vram of a single gpu (and ideally supports flash attention, optionally speculative decoding) will give you far faster autocompletes. This is especially the case with the Radeon VII thanks to the memory bandwidth.