Qwen3-Coder-Next
(qwen.ai)583 points by danielhanchen 11 hours ago
583 points by danielhanchen 11 hours ago
It's interesting how people can be so into LLMs but dont, at the end of the day, understand they're just passing "well formatted" text to a text processor and everything else is build around encoding/decoding it into familiar or novel interfaces & the rest.
The instability of the tooling outside of the LLM is what keeps me from building anything on the cloud, because you're attaching your knowledge and work flow to a tool that can both change dramatically based on context, cache, and model changes and can arbitrarily raise prices as "adaptable whales" push the cost up.
Its akin to learning everything about beanie babies in the early 1990's and right when you think you understand the value proposition, suddenly they're all worthless.
That's why you can use latest open coding models locally that reportedly reached the performance of Sonet-4.5 so almost SOTA. And then you can think of tricks like I mentioned above to directly manipulate GPU RAM for context cleanup when needed which is not possible with cloud models unless their provider enables that.
how can anyone keep up with all these releases... what's next? Sonnet 5?
This is going to be a crazy month because the Chinese labs are all trying to get their releases out prior to their holidays (Lunar New Year / Spring Festival).
So we've seen a series of big ones already -- GLM 4.7 Flash, Kimi 2.5, StepFun 3.5, and now this. Still to come is likely a new DeepSeek model, which could be exciting.
And then I expect the Big3, OpenAI/Google/Anthropic will try to clog the airspace at the same time, to get in front of the potential competition.
Going to try this over Kimi k2.5 locally. It was nice but just a bit too slow and a resource hog.
Looks great - i'll try to check it out on my gaming PC.
On a misc note: What's being used to create the screen recordings? It looks so smooth!
It might be Screen Studio [0] -- I was gonna write "99% sure" but now I'm not sure at all!!
We are getting there, as a next step please release something to outperform Opus 4.5 and GPT 5.2 in coding tasks
By the time that happens, Opus 5 and GPT-5.5 will be out. At that point will a GPT-5.2 tier open-weights model feel "good enough"? Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.
I think it depends on what you use it for. Coding, where time is money? You probably want the Good Shit, but also want decent open weights models to keep prices sane rather than sama’s 20k/month nonsense. Something like a basic sentiment analysis? You can get good results out of a 30b MoE that runs at good pace on a midrange laptop. Researching things online with many sources and decent results I’d expect to be doable locally by the end of 2026 if you have 128GB ram, although it’ll take a while to resolve.
What does it mean for U.S. AI firms if the new equilibrium is devs running open models on local hardware?
OpenAI isn’t cornering the market on DRAM for kicks…
When Alibaba succeeds at producing a GPT-5.2-equivalent model, they won't be releasing the weights. They'll only offer API access, like for the previous models in the Qwen Max series.
Don't forget that they want to make money in the end. They release small models for free because the publicity is worth more than they could charge for them, but they won't just give away models that are good enough that people would pay significant amounts of money to use them.
Mode like open local models are becoming "good enough".
I got stuff done with Sonnet 3.7 just fine, it did need a bunch of babysitting, but still it was a net positive to productivity. Now local models are at that level, closing up on the current SOTA.
When "anyone" can run an Opus 4.5 level model at home, we're going to be getting diminishing returns from closed online-only models.
If an open weights model is released that’s as capable at coding as Opus 4.5, then there’s very little reason not to offload the actual writing of code to open weight subagents running locally and stick strictly to planning with Opus 5. Could get you masses more usage out of your plan (or cut down on API costs).
I'm going in the opposite direction: with each new model, the more I try to optimize my existing workflows by breaking the tasks down so that I can delegate tasks to the less powerful models and only rely on the newer ones if the results are not acceptable.
> Based on my experience with frontier models, once you get a taste of the latest and greatest it's very hard to go back to a less capable model, even if that less capable model would have been SOTA 9 months ago.
That's the tyranny of comfort. Same for high end car, living in a big place, etc.
There's a good work around though: just don't try the luxury in the first place so you can stay happy with the 9 months delay.
Still nothing to compete with GPT-OSS-20B for local image with 16 VRAM.
Is there any online resource tracking local model capability on say... a $2000 64gb memory Mac Mini? I'm getting increasingly excited about the local model space because it offers us a future where we can benefit from LLMs without having to listen to tech CEOs saber rattle about removing America of its jobs so they can get the next fundraising round sorted
My IT department is convinced these "ChInEsE cCcP mOdElS" are going to exfiltrate our entire corporate network of its essential fluids and vita.. erh, I mean data. I've tried explaining to them that it's physically impossible for model weights to make network requests on their own. Also, what happened to their MitM-style, extremely intrusive network monitoring that they insisted we absolutely needed?
The agent orchestration point from vessenes is interesting - using faster, smaller models for routine tasks while reserving frontier models for complex reasoning.
In practice, I've found the economics work like this:
1. Code generation (boilerplate, tests, migrations) - smaller models are fine, and latency matters more than peak capability 2. Architecture decisions, debugging subtle issues - worth the cost of frontier models 3. Refactoring existing code - the model needs to "understand" before changing, so context and reasoning matter more
The 3B active parameters claim is the key unlock here. If this actually runs well on consumer hardware with reasonable context windows, it becomes the obvious choice for category 1 tasks. The question is whether the SWE-Bench numbers hold up for real-world "agent turn" scenarios where you're doing hundreds of small operations.
I find it really surprising that you’re fine with low end models for coding - I went through a lot of open-weights models, local and "local", and I consistently found the results underwhelming. The glm-4.7 was the smallest model I found to be somewhat reliable, but that’s a sizable 350b and stretches the definition of local-as-in-at-home.
If it weren't for the single em-dash (really an en-dash, used as if it were an em-dash), how am I supposed to know that?
And at the end of the day, does it matter?
Does Qwen3 allow adjusting context during an LLM call or does the housekeeping need to be done before/after each call but not when a single LLM call with multiple tool calls is in progress?