Comment by jodleif
Comment by jodleif 2 days ago
I genuinely do not understand the evaluations of the US AI industry. The chinese models are so close and far cheaper
Comment by jodleif 2 days ago
I genuinely do not understand the evaluations of the US AI industry. The chinese models are so close and far cheaper
Nothing you said helps with the issue of valuation. Yes, the US models may be better by a few percentage points, but how can they justify being so costly, both operationally as well as in investment costs? Over the long run, this is a business and you don't make money being the first, you have to be more profitable overall.
I think the investment race here is an "all-pay auction"*. Lots of investors have looked at the ultimate prize — basically winning something larger than the entire present world economy forever — and think "yes".
But even assuming that we're on the right path for that (which we may not be) and assuming that nothing intervenes to stop it (which it might), there may be only one winner, and that winner may not have even entered the game yet.
> investors have looked at the ultimate prize — basically winning something larger than the entire present world economy
This is what people like Altman want investors to believe. It seems like any other snake oil scam because it doesn't match reality of what he delivers.
Qwen Image and Image Edit were among the best image models until Nano Banana Pro came along. I have tried some open image models and can confirm , the Chinese models are easily the best or very close to the best, but right now the Google model is even better... we'll see if the Chinese catch up again.
I'd say Google still hasn't caught up on the smaller model side at all, but we've all been (rightfully) wowed enough by Pro to ignore that for now.
Nano Banano Pro starts at 15 cents per image at <2k resolution, and is not strictly better than Seedream 4.0: yet the latter does 4K for 3 cents per image.
Add in the power of fine-tuning on their open weight models and I don't know if China actually needs to catch up.
I finetuned Qwen Image on 200 generations from Seedream 4.0 that were cleaned up with Nano Banana Pro, and got results that were as good and more reliable than either model could achieve otherwise.
> Chinese models typically focus on text
Not true at all. Qwen has a VLM (qwen2 vl instruct) which is the backbone of Bytedance’s TARS computer use model. Both Alibaba (Qwen) and Bytedance are Chinese.
Also DeepSeek got a ton of attention with their OCR paper a month ago which was an explicit example of using images rather than text.
Qwen, Hunyuan, and WAN are three of the major competitors in the vision, text-to-image, and image-to-video spaces. They are quite competitive. Right now WAN is only behind Google's Veo in image-to-video rankings on llmarena for example
Thanks for sharing that!
The scales are a bit murky here, but if we look at the 'Coding' metric, we see that Kimi K2 outperforms Sonnet 4.5 - that's considered to be the price-perf darling I think even today?
I haven't tried these models, but in general there have been lots of cases where a model performs much worse IRL than the benchmarks would sugges (certain Chinese models and GPT-OSS have been guilty of this in the past)
Good question. There's 2 points to consider.
• For both Kimi K2 and for Sonnet, there's a non-thinking and a thinking version. Sonnet 4.5 Thinking is better than Kimi K2 non-thinking, but the K2 Thinking model came out recently, and beats it on all comparable pure-coding benchmarks I know: OJ-Bench (Sonnet: 30.4% < K2: 48.7%), LiveCodeBench (Sonnet: 64% < K2: 83%), they tie at SciCode at 44.8%. It is a finding shared by ArtificialAnalysis: https://artificialanalysis.ai/models/capabilities/coding
• The reason developers love Sonnet 4.5 for coding, though, is not just the quality of the code. They use Cursor, Claude Code, or some other system such as Github Copilot, which are increasingly agentic. On the Agentic Coding criteria, Sonnet 4.5 Thinking is much higher.
By the way, you can look at the Table tab to see all known and predicted results on benchmarks.
Yes extremely likely they are prone to censorship based on the training. Try running them with something like LM Studio locally and ask it questions the government is uncomfortable about. I originally thought the bias was in the GUI, but it's baked into the model itself.
It's all about the hardware and infrastructure. If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini. The chinese models may benchmark close on paper, but real-world deployment is different. So you either buy your own hardware in order to run a chinese model at 150-200tps or give up an use one of the Big 3.
The US labs aren't just selling models, they're selling globally distributed, low-latency infrastructure at massive scale. That's what justifies the valuation gap.
Edit: It looks like Cerebras is offering a very fast GLM 4.6
Gemini 3 = ~70tps https://openrouter.ai/google/gemini-3-pro-preview
Opus 4.5 = ~60-80tps https://openrouter.ai/anthropic/claude-opus-4.5
Kimi-k2-think = ~60-180tps https://openrouter.ai/moonshotai/kimi-k2-thinking
Deepseek-v3.2 = ~30-110tps (only 2 providers rn) https://openrouter.ai/deepseek/deepseek-v3.2
It doesn't work like that. You need to actually use the model and then go to /activity to see the actual speed. I constantly get 150-200tps from the Big 3 while other providers barely hit 50tps even though they advertise much higher speeds. GLM 4.6 via Cerebras is the only one faster than the closed source models at over 600tps.
These aren't advertised speeds, they are the average measured speeds by openrouter across different providers.
The network effects of using consistently behaving models and maintaining API coverage between updates is valuable, too - presumably the big labs are including their own domains of competence in the training, so Claude is likely to remain being very good at coding, and behave in similar ways, informed and constrained by their prompt frameworks, so that interactions will continue to work in predictable ways even after major new releases occur, and upgrades can be clean.
It'll probably be a few years before all that stuff becomes as smooth as people need, but OAI and Anthropic are already doing a good job on that front.
Each new Chinese model requires a lot of testing and bespoke conformance to every task you want to use it for. There's a lot of activity and shared prompt engineering, and some really competent people doing things out in the open, but it's generally going to take a lot more expert work getting the new Chinese models up to snuff than working with the big US labs. Their product and testing teams do a lot of valuable work.
Qwen 3 Coder Plus has been braindead this past weekend, but Codex 5.1 has also been acting up. It told me updating UI styling was too much work and I should do it myself. I also see people complaining about Claude every week. I think this is an unsolved problem, and you also have to separate perception from actual performance, which I think is an impossible task.
> If you check OpenRouter, no provider offers a SOTA chinese model matching the speed of Claude, GPT or Gemini.
I think GLM 4.6 offered by Cerebras is much faster than any US model.
cerebras AI offers models at 50x the speed of sonnet?
Valuation is not based on what they have done but what they might do, I agree tho it's investment made with very little insight into Chinese research. I guess it's counting on deepseek being banned and all computers in America refusing to run open software by the year 2030 /snark
>I guess it's counting on deepseek being banned
And the people making the bets are in a position to make sure the banning happens. The US government system being what it is.
Not that our leaders need any incentive to ban Chinese tech in this space. Just pointing out that it's not necessarily a "bet".
"Bet" imply you don't know the outcome and you have no influence over the outcome. Even "investment" implies you don't know the outcome. I'm not sure that's the case with these people?
Yet tbh if the US industry had not moved ahead and created the race with FOMO it would not had been easier for Chinese strategy to work either.
The nature of the race may change as yet though, and I am unsure if the devil is in the details, as in very specific edge cases that will work only with frontier models ?
I would expect one of the motivations for making these LLM model weights open is to undermine the valuation of other players in the industry. Open models like this must diminish the value prop of the frontier focused companies if other companies can compete with similar results at competitive prices.
There is a great deal of orientalism --- it is genuinely unthinkable to a lot of American tech dullards that the Chinese could be better at anything requiring what they think of as "intelligence." Aren't they Communist? Backward? Don't they eat weird stuff at wet markets?
It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed. Even now, when you ask questions like what you ask of that era, the answers you get are genuinely not better than "yes, this should have been obvious at the time if you were not completely blinded by ethnic and especially ideological prejudice."
Back when deepseek came out and people were tripping over themselves shouting it was so much better than what was out there, it just wasn’t good.
It might be this model is super good, I haven’t tried it, but to say the Chinese models are better is just not true.
What I really love though is that I can run them (open models) on my own machine. The other day I categorised images locally using Qwen, what a time to be alive.
Further even than local hardware, open models make it possible to run on providers of choice, such as European ones. Which is great!
So I love everything about the competitive nature of this.
If you thought DeepSeek "just wasn't good," there's a good chance you were running it wrong.
For instance, a lot of people thought they were running "DeepSeek" when they were really running some random distillation on ollama.
Good point, I was assuming the GP was running local for some reason. Hard to argue when it's the official providers who are being compared.
I ran the 1.58-bit Unsloth quant locally at the time it came out, and even at such low precision, it was super rare for it to get something wrong that o1 and GPT4 got right. I have never actually used a hosted version of the full DS.
Early stages of Barbarossa were very successful and much of the Soviet Air Force, which had been forward positioned for invasion, was destroyed. Given the Red Army’s attitude toward consent, I would keep the praise carefully measured. TV has taught us there are good guys and bad guys when the reality is closer to just bad guys and bad guys
I don't think that anyone, much less someone working in tech or engineering in 2025, could still hold beliefs about Chinese not being capable scientists or engineers. I could maybe give (the naive) pass to someone in 1990 thinking China will never build more than junk. But in 2025 their product capacity, scientific advancement, and just the amount of us who have worked with extremely talented Chinese colleagues should dispel those notions. I think you are jumping to racism a bit fast here.
Germany was right in some ways and wrong in others for the soviet unions strength. USSR failed to conquer Finland because of the military purges. German intelligence vastly under-estimated the amount of tanks and general preparedness of the Soviet army (Hitler was shocked the soviets had 40k tanks already). Lend Lease act really sent an astronomical amount of goods to the USSR which allowed them to fully commit to the war and really focus on increasing their weapon production, the numbers on the amount of tractors, food, trains, ammunition, etc. that the US sent to the USSR is staggering.
I don't think anyone seriously believes that the Chinese aren't capable, it's more like people believe no matter what happens, USA will still dominate in "high tech" fields. A variant of "American Exceptionalism" so to speak.
This is kinda reflected in the stock market, where the AI stocks are surging to new heights every day, yet their Chinese equivalents are relatively lagging behind in stock price, which suggests that investors are betting heavily on the US companies to "win" this "AI race" (if there's any gains to be made by winning).
Also, in the past couple years (or maybe a couple decades), there had also been a lot of crap talk about how China has to democratize and free up their markets in order to be competitive with the other first world countries, together with a bunch of "doomsday" predictions for authoritarianism in China. This narrative has completely lost any credibility, but the sentiment dies slowly...
Not sure how the entire Nazi comparison plays out, but at the time there were good reasons to imagine the Soviets will fall apart (as they initially did)
Stalin just finished purging his entire officer corps, which is not a good omen for war, and the USSR failed miserably against the Finnish who were not the strongest of nations, while Germany just steamrolled France, a country that was much more impressive in WW1 than the Russians (who collapsed against Germany)
They did, but the goalposts keep moving, so to speak. We're approximately here : advanced semiconductors, artificial intelligence, reusable rockets, quantum computing, etc. Chinese will never catch up. /s
"It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; ..."
Ideology played a role, but the data they worked with, was the finnish war, that was disastrous for the sowjet side. Hitler later famously said, it was all a intentionally distraction to make them believe the sowjet army was worth nothing. (Real reasons were more complex, like previous purging).
> It reminds me, in an encouraging way, of the way that German military planners regarded the Soviet Union in the lead-up to Operation Barbarossa. The Slavs are an obviously inferior race; their Bolshevism dooms them; we have the will to power; we will succeed
Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.
> Though, because Stalin had decimated the red army leadership (including most of the veteran officer who had Russian civil war experience) during the Moscow trials purges, the German almost succeeded.
There were many counter revolutionaries among the leadership, even those conducting the purges. Stalin was like "ah fuck we're hella compromised." Many revolutions fail in this step and often end up facing a CIA backed coup. The USSR was under constant siege and attempted infiltration since inception.
> There were many counter revolutionaries among the leadership
Well, Stalin was, by far, the biggest counter-revolutionary in the Politburo.
> Stalin was like "ah fuck we're hella compromised."
There's no evidence that anything significant was compromised at that point, and clear evidence that Stalin was in fact medically paranoid.
> Many revolutions fail in this step and often end up facing a CIA backed coup. The USSR was under constant siege and attempted infiltration since inception.
Can we please not recycle 90-years old soviet propaganda? The Moscow trial being irrational self-harm was acknowledged by the USSR leadership as early as the fifties…
Two aspects to consider:
1. Chinese models typically focus on text. US and EU models also bear the cross of handling image, often voice and video. Supporting all those is additional training costs not spent on further reasoning, tying one hand in your back to be more generally useful.
2. The gap seems small, because so many benchmarks get saturated so fast. But towards the top, every 1% increase in benchmarks is significantly better.
On the second point, I worked on a leaderboard that both normalizes scores, and predicts unknown scores to help improve comparisons between models on various criteria: https://metabench.organisons.com/
You can notice that, while Chinese models are quite good, the gap to the top is still significant.
However, the US models are typically much more expensive for inference, and Chinese models do have a niche on the Pareto frontier on cheaper but serviceable models (even though US models also eat up the frontier there).