Comment by timpera

Youden a day ago

They mentioned LMArena, you can get the results for that here: https://lmarena.ai/leaderboard/text

Mistral Large 3 is ranked 28, behind all the other major SOTA models. The delta between Mistral and the leader is only 1418 vs. 1491 though. I *think* that means the difference is relatively small.

Reply View 5 replies

jampekka 21 hours ago

1491 vs 1418 ELO means the stronger model wins about 60% of the time.

Reply View | 4 replies
- supermatt 21 hours ago
  
  Probably naive questions:
  Does that also mean that Gemini-3 (the top ranked model) loses to mistral 3 40% of the time?
  Does that make Gemini 1.5x better, or mistral 2/3rd as good as Gemini, or can we not quantify the difference like that?
  
  Reply View | 3 replies
  
  esafak 21 hours ago
  
  Yes, of course.
  
  Reply View | 2 replies

qznc a day ago

I guess that could be considered comparative advertising then and companies generally try to avoid that scrutiny.

Reply View 0 replies

constantcrying a day ago

The lack of the comparison (which absolutely was done), tells you exactly what you need to know.

Reply View 25 replies

bildung 21 hours ago

I think people from the US often aren't aware how many companies from the EU simply won't risk losing their data to the providers you have in mind, OpenAI, Anthropic and Google. They simply are no option at all.
The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.
Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.

Reply View | 9 replies
- adam_patarino 19 hours ago
  
  We're seeing the same thing for many companies, even in the US. Exposing your entire codebase to an unreliable third party is not exactly SOC / ISO compliant. This is one of the core things that motivated us to develop cortex.build so we could put the model on the developer's machine and completely isolate the code without complicated model deployments and maintenance.
  
  Reply View | 0 replies
- BoorishBears 20 hours ago
  
  Mistral is founded by multiple Meta engineers, no?
  Funded mostly by US VCs?
  Hosted primarily on Azure?
  Do you really have to go out of your way to start calling their competition "data leeches" for out-executing them?
  
  Reply View | 7 replies
  
  bildung 27 minutes ago
  
  I didn't mean to imply US bad EU good. As such, this isn't about which passport the VCs have, but about local hosting and open weight models. A closed model from a US company always comes the risk of data exfiltration either for training or thanks to CLOUD Act etc (i.e. industrial espionage).
  And personally I don't care at all about the performance delta - we are talking about a difference of 6 to at most 12 months here, between closed source SOTA and open weight models.
  
  Reply View | 0 replies
  
  troyvit 18 hours ago
  
  It's wayyyy to early in the game to say who is out-executing whom.
  I mean why do you think those guys left Meta? It reminds me of a time ten years ago I was sitting on a flight with a guy who works for the natural gas industry. I was (cough still am) a pretty naive environmentalist, so I asked him what he thought of solar, wind, etc. and why should we be investing in natural gas when there are all these other options. His response was simple. Natural gas can serve as a bridge from hydrocarbons to true green energy sources. Leverage that dense energy to springboard the other sources in the mix and you build a path forward to carbon free energy.
  I see Mistral's use of US VCs the same way. Those VCs are hedging their bets and maybe hoping to make a few bucks. A few of them are probably involved because they're buddies with the former Meta guys "back in the day." If Mistral executes on their plan of being a transparent b2b option with solid data protections then they used those VCs the way they deserve to be used and the VCs make a few bucks. If Europe ever catches up to the US in terms of data centers, would Mistral move off of Azure? I'd bet $5 that they would.
  
  Reply View | 0 replies
  
  sofixa 18 hours ago
  
  Mistral are mostly focusing on b2b, and for customers that want to self-host (banks and stuff). So their founders being from Meta, or where their cloud platform are hosted, are entirely irrelevant to the story.
  
  Reply View | 4 replies
popinman322 a day ago

They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.
There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.
A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.

Reply View | 2 replies
- kalkin 21 hours ago
  
  Scale AI wrote a paper a year ago comparing various models performance on benchmarks to performance on similar but held-out questions. Generally the closed source models performed better, and Mistral came out looking pretty badly: https://arxiv.org/pdf/2405.00332
  
  Reply View | 0 replies
- extr 21 hours ago
  
  ??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.
  We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)
  
  Reply View | 0 replies
tarruda a day ago

Here's what I understood from the blog post:
- Mistral Large 3 is comparable with the previous Deepseek release.
- Ministral 3 LLMs are comparable with older open LLMs of similar sizes.

Reply View | 6 replies
- constantcrying a day ago
  
  And implicit in this is that it compares very poorly to SOTA models. Do you disagree with that? Do you think these Models are beating SOTA and they did not include the benchmarks, because they forgot?
  
  Reply View | 5 replies
  
  saubeidl 21 hours ago
  
  Those are SOTA for open models. It's a separate league from closed models entirely.
  
  Reply View | 1 reply
  
  supermatt 21 hours ago
  
  > It's a separate league from closed models entirely.
  To be fair, the SOTA models aren't even a single LLM these days. They are doing all manner of tool use and specialised submodel calls behind the scenes - a far cry from in-model MoE.
  
  Reply View | 0 replies
  
  tarruda a day ago
  
  > Do you disagree with that?
  I think that Qwen3 8B and 4B are SOTA for their size. The GPQA Diamond accuracy chart is weird: Both Qwen3 8B and 4B have higher scores, so they used this weid chart where "x" axis shows the number of output tokens. I missed the point of this.
  
  Reply View | 2 replies
crimsoneer a day ago

If someone is using these models, they probably can't or won't use the existing SOTA models, so not sure how useful those comparisons actually are. "Here is a benchmark that makes us look bad from a model you can't use on a task you won't be undertaking" isn't actually helpful (and definitely not in a press release).

Reply View | 4 replies
- constantcrying a day ago
  
  Completely agree, that there are legitimate reasons to prefer comparison to e.g. deepeek models. But that doesn't change my point, we both agree that the comparisons would be extremely unfavorable.
  
  Reply View | 3 replies
  
  Lapel2742 a day ago
  
  > that the comparisons would be extremely unfavorable.
  Why should they compare apples to oranges? Ministral3 Large costs ~1/10th of Sonnet 4.5. They clearly target different users. If you want a coding assistant you probably wouldn't choose this model for various reasons. There is place for more than only the benchmark king.
  
  Reply View | 2 replies

rvz 21 hours ago

> I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release,

Why would they? They know they can't compete against the heavily closed-source models.

They are not even comparing against GPT-OSS.

That is absolutely and shockingly bearish.

Reply View 0 replies