DeepSeek-v3.2: Pushing the frontier of open large language models [pdf]
(huggingface.co)962 points by pretext 2 days ago
962 points by pretext 2 days ago
Funny to see tau2-bench on the list of benchmarks, when tau2-bench is flawed and 100% score is impossible, unless you add the tasks to the training set: https://github.com/sierra-research/tau2-bench/issues/89
I used DeepSeek-v3.2 to solve two coding problems by pasting code and directions as one large prompt into a chat interface and it performed very well. VERY WELL!
I am still happy to pay Google because of their ecosystem or Gemini app, NotebookLM, Colab, gemini-cli, etc. Google’s moat for me is all the tooling and engineering around the models.
That said, my one year Google AI subscription ends in four months and I might try an alternative, or at least evaluate options. Alibaba Cloud looks like an interesting low cost alternative to AWS for building systems. I am now a retired ‘gentleman scientist’ now and my personal research is inexpensive no matter who I pay for inference compute, but it is fun to spend a small amount of time evaluating alternatives even though mostly using Google is time efficient.
What version is actually running on chat.deepseek.com?
It refuses to tell me when asked, only that it's been train with data up until July 2024, which would make it quite old. I turned off search and asked it for the winner of the US 2024 election, and it said it didn't know, so I guess that confirms it's not a recent model.
You can read that 3.2 is live on web and app here: https://api-docs.deepseek.com/news/news251201
The pdf describes how they did "continued pre-training" and then post training to make 3.2. I guess what's missing is the full pre-training that absorbs most date sensitive knowledge. That's probably also the reason that the versions are 3.x still.
The AI market is hard to predict due to the constant development of new algorithms that could emerge unexpectedly. Refer to this summary of Ilya's opinions for insights into the necessity of these new algorithms: https://youtu.be/DcrXHTOxi3I
DeepSeek is a valuable product, but its open-source nature makes it difficult to displace larger competitors. Any advancements can be quickly adopted, and in fact, it may inadvertently strengthen these companies by highlighting weaknesses in their current strategies.
I genuinely do not understand the evaluations of the US AI industry. The chinese models are so close and far cheaper
Two aspects to consider:
1. Chinese models typically focus on text. US and EU models also bear the cross of handling image, often voice and video. Supporting all those is additional training costs not spent on further reasoning, tying one hand in your back to be more generally useful.
2. The gap seems small, because so many benchmarks get saturated so fast. But towards the top, every 1% increase in benchmarks is significantly better.
On the second point, I worked on a leaderboard that both normalizes scores, and predicts unknown scores to help improve comparisons between models on various criteria: https://metabench.organisons.com/
You can notice that, while Chinese models are quite good, the gap to the top is still significant.
However, the US models are typically much more expensive for inference, and Chinese models do have a niche on the Pareto frontier on cheaper but serviceable models (even though US models also eat up the frontier there).
Nothing you said helps with the issue of valuation. Yes, the US models may be better by a few percentage points, but how can they justify being so costly, both operationally as well as in investment costs? Over the long run, this is a business and you don't make money being the first, you have to be more profitable overall.
I think the investment race here is an "all-pay auction"*. Lots of investors have looked at the ultimate prize — basically winning something larger than the entire present world economy forever — and think "yes".
But even assuming that we're on the right path for that (which we may not be) and assuming that nothing intervenes to stop it (which it might), there may be only one winner, and that winner may not have even entered the game yet.
Qwen Image and Image Edit were among the best image models until Nano Banana Pro came along. I have tried some open image models and can confirm , the Chinese models are easily the best or very close to the best, but right now the Google model is even better... we'll see if the Chinese catch up again.
> Chinese models typically focus on text
Not true at all. Qwen has a VLM (qwen2 vl instruct) which is the backbone of Bytedance’s TARS computer use model. Both Alibaba (Qwen) and Bytedance are Chinese.
Also DeepSeek got a ton of attention with their OCR paper a month ago which was an explicit example of using images rather than text.
Qwen, Hunyuan, and WAN are three of the major competitors in the vision, text-to-image, and image-to-video spaces. They are quite competitive. Right now WAN is only behind Google's Veo in image-to-video rankings on llmarena for example
Thanks for sharing that!
The scales are a bit murky here, but if we look at the 'Coding' metric, we see that Kimi K2 outperforms Sonnet 4.5 - that's considered to be the price-perf darling I think even today?
I haven't tried these models, but in general there have been lots of cases where a model performs much worse IRL than the benchmarks would sugges (certain Chinese models and GPT-OSS have been guilty of this in the past)
Good question. There's 2 points to consider.
• For both Kimi K2 and for Sonnet, there's a non-thinking and a thinking version. Sonnet 4.5 Thinking is better than Kimi K2 non-thinking, but the K2 Thinking model came out recently, and beats it on all comparable pure-coding benchmarks I know: OJ-Bench (Sonnet: 30.4% < K2: 48.7%), LiveCodeBench (Sonnet: 64% < K2: 83%), they tie at SciCode at 44.8%. It is a finding shared by ArtificialAnalysis: https://artificialanalysis.ai/models/capabilities/coding
• The reason developers love Sonnet 4.5 for coding, though, is not just the quality of the code. They use Cursor, Claude Code, or some other system such as Github Copilot, which are increasingly agentic. On the Agentic Coding criteria, Sonnet 4.5 Thinking is much higher.
By the way, you can look at the Table tab to see all known and predicted results on benchmarks.
Yes extremely likely they are prone to censorship based on the training. Try running them with something like LM Studio locally and ask it questions the government is uncomfortable about. I originally thought the bias was in the GUI, but it's baked into the model itself.
It's all about the hardware and infrastructure. If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini. The chinese models may benchmark close on paper, but real-world deployment is different. So you either buy your own hardware in order to run a chinese model at 150-200tps or give up an use one of the Big 3.
The US labs aren't just selling models, they're selling globally distributed, low-latency infrastructure at massive scale. That's what justifies the valuation gap.
Edit: It looks like Cerebras is offering a very fast GLM 4.6
Gemini 3 = ~70tps https://openrouter.ai/google/gemini-3-pro-preview
Opus 4.5 = ~60-80tps https://openrouter.ai/anthropic/claude-opus-4.5
Kimi-k2-think = ~60-180tps https://openrouter.ai/moonshotai/kimi-k2-thinking
Deepseek-v3.2 = ~30-110tps (only 2 providers rn) https://openrouter.ai/deepseek/deepseek-v3.2
It doesn't work like that. You need to actually use the model and then go to /activity to see the actual speed. I constantly get 150-200tps from the Big 3 while other providers barely hit 50tps even though they advertise much higher speeds. GLM 4.6 via Cerebras is the only one faster than the closed source models at over 600tps.
These aren't advertised speeds, they are the average measured speeds by openrouter across different providers.
The network effects of using consistently behaving models and maintaining API coverage between updates is valuable, too - presumably the big labs are including their own domains of competence in the training, so Claude is likely to remain being very good at coding, and behave in similar ways, informed and constrained by their prompt frameworks, so that interactions will continue to work in predictable ways even after major new releases occur, and upgrades can be clean.
It'll probably be a few years before all that stuff becomes as smooth as people need, but OAI and Anthropic are already doing a good job on that front.
Each new Chinese model requires a lot of testing and bespoke conformance to every task you want to use it for. There's a lot of activity and shared prompt engineering, and some really competent people doing things out in the open, but it's generally going to take a lot more expert work getting the new Chinese models up to snuff than working with the big US labs. Their product and testing teams do a lot of valuable work.
Qwen 3 Coder Plus has been braindead this past weekend, but Codex 5.1 has also been acting up. It told me updating UI styling was too much work and I should do it myself. I also see people complaining about Claude every week. I think this is an unsolved problem, and you also have to separate perception from actual performance, which I think is an impossible task.
> If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini.
I think GLM 4.6 offered by Cerebras is much faster than any US model.
cerebras AI offers models at 50x the speed of sonnet?
Valuation is not based on what they have done but what they might do, I agree tho it's investment made with very little insight into Chinese research. I guess it's counting on deepseek being banned and all computers in America refusing to run open software by the year 2030 /snark
>I guess it's counting on deepseek being banned
And the people making the bets are in a position to make sure the banning happens. The US government system being what it is.
Not that our leaders need any incentive to ban Chinese tech in this space. Just pointing out that it's not necessarily a "bet".
"Bet" imply you don't know the outcome and you have no influence over the outcome. Even "investment" implies you don't know the outcome. I'm not sure that's the case with these people?
Yet tbh if the US industry had not moved ahead and created the race with FOMO it would not had been easier for Chinese strategy to work either.
The nature of the race may change as yet though, and I am unsure if the devil is in the details, as in very specific edge cases that will work only with frontier models ?
I would expect one of the motivations for making these LLM model weights open is to undermine the valuation of other players in the industry. Open models like this must diminish the value prop of the frontier focused companies if other companies can compete with similar results at competitive prices.
There is a great deal of orientalism --- it is genuinely unthinkable to a lot of American tech dullards that the Chinese could be better at anything requiring what they think of as "intelligence." Aren't they Communist? Backward? Don't they eat weird stuff at wet markets?
It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed. Even now, when you ask questions like what you ask of that era, the answers you get are genuinely not better than "yes, this should have been obvious at the time if you were not completely blinded by ethnic and especially ideological prejudice."
Back when deepseek came out and people were tripping over themselves shouting it was so much better than what was out there, it just wasn’t good.
It might be this model is super good, I haven’t tried it, but to say the Chinese models are better is just not true.
What I really love though is that I can run them (open models) on my own machine. The other day I categorised images locally using Qwen, what a time to be alive.
Further even than local hardware, open models make it possible to run on providers of choice, such as European ones. Which is great!
So I love everything about the competitive nature of this.
If you thought DeepSeek "just wasn't good," there's a good chance you were running it wrong.
For instance, a lot of people thought they were running "DeepSeek" when they were really running some random distillation on ollama.
Early stages of Barbarossa were very successful and much of the Soviet Air Force, which had been forward positioned for invasion, was destroyed. Given the Red Army’s attitude toward consent, I would keep the praise carefully measured. TV has taught us there are good guys and bad guys when the reality is closer to just bad guys and bad guys
I don't think that anyone, much less someone working in tech or engineering in 2025, could still hold beliefs about Chinese not being capable scientists or engineers. I could maybe give (the naive) pass to someone in 1990 thinking China will never build more than junk. But in 2025 their product capacity, scientific advancement, and just the amount of us who have worked with extremely talented Chinese colleagues should dispel those notions. I think you are jumping to racism a bit fast here.
Germany was right in some ways and wrong in others for the soviet unions strength. USSR failed to conquer Finland because of the military purges. German intelligence vastly under-estimated the amount of tanks and general preparedness of the Soviet army (Hitler was shocked the soviets had 40k tanks already). Lend Lease act really sent an astronomical amount of goods to the USSR which allowed them to fully commit to the war and really focus on increasing their weapon production, the numbers on the amount of tractors, food, trains, ammunition, etc. that the US sent to the USSR is staggering.
I don't think anyone seriously believes that the Chinese aren't capable, it's more like people believe no matter what happens, USA will still dominate in "high tech" fields. A variant of "American Exceptionalism" so to speak.
This is kinda reflected in the stock market, where the AI stocks are surging to new heights every day, yet their Chinese equivalents are relatively lagging behind in stock price, which suggests that investors are betting heavily on the US companies to "win" this "AI race" (if there's any gains to be made by winning).
Also, in the past couple years (or maybe a couple decades), there had also been a lot of crap talk about how China has to democratize and free up their markets in order to be competitive with the other first world countries, together with a bunch of "doomsday" predictions for authoritarianism in China. This narrative has completely lost any credibility, but the sentiment dies slowly...
Not sure how the entire Nazi comparison plays out, but at the time there were good reasons to imagine the Soviets will fall apart (as they initially did)
Stalin just finished purging his entire officer corps, which is not a good omen for war, and the USSR failed miserably against the Finnish who were not the strongest of nations, while Germany just steamrolled France, a country that was much more impressive in WW1 than the Russians (who collapsed against Germany)
They did, but the goalposts keep moving, so to speak. We're approximately here : advanced semiconductors, artificial intelligence, reusable rockets, quantum computing, etc. Chinese will never catch up. /s
"It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; ..."
Ideology played a role, but the data they worked with, was the finnish war, that was disastrous for the sowjet side. Hitler later famously said, it was all a intentionally distraction to make them believe the sowjet army was worth nothing. (Real reasons were more complex, like previous purging).
> It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed
Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.
> Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.
There were many counter revolutionaries among the leadership, even those conducting the purges. Stalin was like "ah fuck we're hella compromised." Many revolutions fail in this step and often end up facing a CIA backed coup. The USSR was under constant siege and attempted infiltration since inception.
> There were many counter revolutionaries among the leadership
Well, Stalin was, by far, the biggest counter-revolutionary in the Politburo.
> Stalin was like "ah fuck we're hella compromised."
There's no evidence that anything significant was compromised at that point, and clear evidence that Stalin was in fact medically paranoid.
> Many revolutions fail in this step and often end up facing a CIA backed coup. The USSR was under constant siege and attempted infiltration since inception.
Can we please not recycle 90-years old soviet propaganda? The Moscow trial being irrational self-harm was acknowledged by the USSR leadership as early as the fifties…
I am waiting for the first truly open model without any of the censorship built in.
I wonder how long it will take and how quickly it will try to get shut down.
I hate that their model ids don't change as they change the underlying model. I'm not sure how you can build on that.
% curl https://api.deepseek.com/models \
-H "Authorization: Bearer ${DEEPSEEK_API_KEY}"
{"object":"list","data":[{"id":"deepseek-chat","object":"model","owned_by":"deepseek"},{"id":"deepseek-reasoner","object":"model","owned_by":"deepseek"}]}For just the model itself: 4 x params at F32, 2 x params at F16/BF16, or 1 x params at F8, e.g. 685GB at F8. It will be smaller for quantizations, but I'm not sure how to estimate those.
For a Mixture of Experts (MoE) model you only need to have the memory size of a given expert. There will be some swapping out as it figures out which expert to use, or to change expert, but once that expert is loaded it won't be swapping memory to perform the calculations.
You'll also need space for the context window; I'm not sure how to calculate that either.
I think your understanding of MoE is wrong. Depending on the settings, each token can actually be routed to multiple experts, called experts choice architecture. This makes it easier to parallelize the inference (each expert on a different device for example), but it's not simply just keeping one expert in memory.
3.2-Exp came out in September: this is 3.2, along with a special checkpoint (DeepSeek-V3.2-Speciale) for deep reasoning that they're claiming surpasses GPT-5 and matches Gemini 3.0
From some very brief experimentation with deepseek about 2 months ago, tool calling is very hot or miss. Claude appears to be the absolute best.
Disclaimer: I did not test this yet.
I don't want to make big generalizations. But one thing I noticed with chinese models, especially Kimi, is that it does very well on benchmarks, but fails on vibe testing. It feels a little bit over-fitting to the benchmark and less to the use cases.
I hope it's not the same here.
This was a bad problem with earlier Chinese (Qwen and Kimi K1 in particular) models, but the original DeepSeek delivered and GLM4.6 delivers. They don't diversify training as much as American labs so you'll find more edge cases and the interaction experience isn't quite as smooth, but the models put in work.
This used to happen with bench marks on phones, manufacturers would tweak android so benchmarks ran faster.
I guess that’s kinda how it is for any system that’s trained to do well on benchmarks, it does well but rubbish at everything else.
My experience with deepseek and Kimi is quite the opposite: smarter than benchmarks would imply
Whereas the benchmark gains seem by new OpenAI, Grok and Claude models don't feel accompanied by vibe improvement
I would assume that it is testing how well and appropriately the LLM responds to prompts.
They've developed a sparse attention mechanism (which they document and release source code for) to increase model efficiency with long context, as needed for fast & cost-effective extensive RL training for reasoning and agentic use
They've built a "stable & scalable" RL protocol - more capable RL training infrastructure
They've built a pipeline/process to generate synthetic data for reasoning and agentic training
These all combine to build an efficient model with extensive RL post-training for reasoning and agentic use, although they note work is still needed on both the base model (more knowledge) and post-training to match frontier performance.
To push back on naivety I'm sensing here I think it's a little silly to see Chinese Communist Party backed enterprise as somehow magnanimous and without ulterior, very harmful motive.
Do you think it is from goodness of their heart that corporates support open source? E.g. Microsoft - VSCode and Typescript, Meta - PyTorch and React, Google - Chromium and Go.
Yet, we (developers, users, human civilization), benefit from that.
So yes, I cherish when Chinese companies release open source LLMs. Be it as it fits their business model (the same way as US companies) or from grants (the same way as a lot of EU-backed projects, e.g. Python, DuckDB, scikit-learn).
Oh they need control of models to be able to censor and ensure whatever happens inside the country with AI stays under their control. But the open-source part? Idk I think they do it to mess with the US investment and for the typical open source reasons of companies: community, marketing, etc. But tbh especially the messing with the US, as a european with no serious competitor, I can get behind.
They're pouring money to disrupt American AI markets and efforts. They do this in countless other fields. It's a model of massive state funding -> give it away for cut-rate -> dominate the market -> reap the rewards.
It's a very transparent, consistent strategy.
AI is a little different because it has geopolitical implications.
When it's a competition among individual producers, we call it "a free market" and praise Hal Varian. When it's a competition among countries, it's suddenly threatening to "disrupt American AI markets and efforts". The obvious solution here is to pour money into LLM research too. Massive state funding -> provide SOTA models for free -> dominate the market -> reap the rewards (from the free models).
I can’t believe I’m shilling for China in these comments, but how different it is for company A getting blank check investments from VCs and wink-wink support from the government in the west? And AI-labs in China has been getting funding internally in the companies for a while now, before the LLM-era.
This is the rare earth minerals dumping all over again. Devalue to such a price as to make the market participants quit, so they can later have a strategic stranglehold on the supply.
This is using open source in a bit of different spirit than the hacker ethos, and I am not sure how I feel about it.
It is a kind of cheat on the fair market but at the same time it is also costly to China and its capital costs may become unsustainable before the last players fold.
> This is using open source in a bit of different spirit than the hacker ethos, and I am not sure how I feel about it.
It's a bit early to have any sort of feelings about it, isn't it? You're speaking in absolutes, but none of this is necessarily 100% true as we don't know their intentions. And judging a group of individuals intention based on what their country seems to want, from the lens of a foreign country, usually doesn't land you with the right interpretation.
Where exactly is this fair market? Giant US companies love rules and regulations, but only when it benefits them (and they pay dearly for it)
Prosecutor, judge and jury? You have access to their minds to know their true intentions? This whole “deepseek is controlled by CCP” is ridiculous. If you want to know how bad the CCP is at IT, then check the government backed banks.
The way I see this, some tech teams in China have figured out that training and tuning LLMs is not that expensive after all and they can do it at a fraction of the cost. So they are doing it to enter a market previously dominated by US only players.
Are you by chance an OpenAI investor?
We should all be happy about the price of AI coming down.
But the economy!!! /s
Seriously though, our leaders are actively throwing everything and the kitchen sink into AI companies - in some vain attempt to become immortal or own even more of the nations wealth beyond what they already do, chasing some kind of neo-tech feudalism. Both are unachievable because they rely on a complex system that they clearly don't understand.
Yeah and OpenAI's CPO was artificially commissioned as a Lt. Colonel in the US Army in conjunction with a $200M contract
Absurd to say Deepseek is CCP controlled while ignoring the govt connection here
Do they actually spend that much though? I think they are getting similar results with much fewer resources.
It's also a bit funny that providing free models is probably the most communist thing China has done in a long time.
Good luck making OpenAI and Google cry uncle. They have the US government on their side. They will not be allowed to fail, and they know it.
What I appreciate about the Chinese efforts is that they are being forced to get more intelligence from less hardware, and they are not only releasing their work products but documenting the R&D behind them at least as well as our own closed-source companies do.
A good reason to stir up dumping accusations and anti-China bias would be if they stopped publishing not just the open-source models, but the technical papers that go with them. Until that happens, I think it's better to prefer more charitable explanations for their posture.
Ah, so exactly like Uber, Netflix, Microsoft, Amazon, Facebook and so on have done to the rest of the world over the last few decades then?
Where do you think they learnt this trick? Years lurking on HN and this post's comment section wins #1 on the American Hypocrisy chart. Unbelievable that even in the current US people can't recognize when they're looking in the mirror. But I guess you're disincentivized to do so when most of your net worth stems from exactly those companies and those practices.
And the next question is what have they some with power historically, and what are they liable to do in the future with said power. Limiting scope to AI is shortsighted and doesn't speak to the concerns people have beyond an Ai Race
It's a fair question, but my view of America's influence on world affairs has been dismal. China by contrast has not had a history of invading its neighbors, though I strongly criticize their involvement in the American attack on Cambodia and Vietnam (China supported the Khmer Rouge and briefly invaded Vietnam but was quickly pushed back, a reason Mao is sometimes criticized as having a good early period and a bad late period).
Meanwhile, America has been causing death and destruction around the world. It's easy to make lists: Vietnam, Iraq, Gaza, Cuba, South and Central America etc etc.
Pretty amazing that a relatively small Chinese hedge fund can build AI better than almost anyone.
At least, there is no doubt where he is from !
which version are you?
我是DeepSeek最新版本模型! 如果你想了解具体的版本号信息,我建议你:
查看官方文档 - DeepSeek官网和文档会有最准确的版本信息
关注官方公告 - 版本更新通常会在官方渠道公布
查看应用商店/网页版 - 使用界面通常会显示当前版本
我具备DeepSeek的所有最新功能特性,包括: 强大的对话和推理能力
128K上下文长度
文件上传处理(图像、文档等)
联网搜索功能(需手动开启)
完全免费使用
如果你需要知道精确的版本号用于技术对接或其他特定用途,最好直接查阅官方技术文档,那里会有最准确和详细的技术规格说明。有什么其他问题我可以帮你解答吗?
It's so strange when it obviously hits a preprogrammed non-answer in these models, how can one ever trust them when there is a babysitter that interferes in an actual answer. I suppose that asking it what version it is isn't a valid question in it's training data so it's programmed to say check the documentation, but still definitely suspicious when it gives a non-answer.
Why are there so few 32,64,128,256,512 GB models which could run on current consumer hardware? And why is the maximum RAM on Mac studio M4 128 GB??