Comment by timpera

Comment by timpera a day ago

34 replies

Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.

Youden a day ago

They mentioned LMArena, you can get the results for that here: https://lmarena.ai/leaderboard/text

Mistral Large 3 is ranked 28, behind all the other major SOTA models. The delta between Mistral and the leader is only 1418 vs. 1491 though. I *think* that means the difference is relatively small.

  • jampekka 21 hours ago

    1491 vs 1418 ELO means the stronger model wins about 60% of the time.

    • supermatt 21 hours ago

      Probably naive questions:

      Does that also mean that Gemini-3 (the top ranked model) loses to mistral 3 40% of the time?

      Does that make Gemini 1.5x better, or mistral 2/3rd as good as Gemini, or can we not quantify the difference like that?

qznc a day ago

I guess that could be considered comparative advertising then and companies generally try to avoid that scrutiny.

constantcrying a day ago

The lack of the comparison (which absolutely was done), tells you exactly what you need to know.

  • bildung 21 hours ago

    I think people from the US often aren't aware how many companies from the EU simply won't risk losing their data to the providers you have in mind, OpenAI, Anthropic and Google. They simply are no option at all.

    The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.

    Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.

    • adam_patarino 19 hours ago

      We're seeing the same thing for many companies, even in the US. Exposing your entire codebase to an unreliable third party is not exactly SOC / ISO compliant. This is one of the core things that motivated us to develop cortex.build so we could put the model on the developer's machine and completely isolate the code without complicated model deployments and maintenance.

    • BoorishBears 20 hours ago

      Mistral is founded by multiple Meta engineers, no?

      Funded mostly by US VCs?

      Hosted primarily on Azure?

      Do you really have to go out of your way to start calling their competition "data leeches" for out-executing them?

      • bildung 27 minutes ago

        I didn't mean to imply US bad EU good. As such, this isn't about which passport the VCs have, but about local hosting and open weight models. A closed model from a US company always comes the risk of data exfiltration either for training or thanks to CLOUD Act etc (i.e. industrial espionage).

        And personally I don't care at all about the performance delta - we are talking about a difference of 6 to at most 12 months here, between closed source SOTA and open weight models.

      • troyvit 18 hours ago

        It's wayyyy to early in the game to say who is out-executing whom.

        I mean why do you think those guys left Meta? It reminds me of a time ten years ago I was sitting on a flight with a guy who works for the natural gas industry. I was (cough still am) a pretty naive environmentalist, so I asked him what he thought of solar, wind, etc. and why should we be investing in natural gas when there are all these other options. His response was simple. Natural gas can serve as a bridge from hydrocarbons to true green energy sources. Leverage that dense energy to springboard the other sources in the mix and you build a path forward to carbon free energy.

        I see Mistral's use of US VCs the same way. Those VCs are hedging their bets and maybe hoping to make a few bucks. A few of them are probably involved because they're buddies with the former Meta guys "back in the day." If Mistral executes on their plan of being a transparent b2b option with solid data protections then they used those VCs the way they deserve to be used and the VCs make a few bucks. If Europe ever catches up to the US in terms of data centers, would Mistral move off of Azure? I'd bet $5 that they would.

      • sofixa 18 hours ago

        Mistral are mostly focusing on b2b, and for customers that want to self-host (banks and stuff). So their founders being from Meta, or where their cloud platform are hosted, are entirely irrelevant to the story.

  • popinman322 a day ago

    They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.

    There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.

    A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.

    • kalkin 21 hours ago

      Scale AI wrote a paper a year ago comparing various models performance on benchmarks to performance on similar but held-out questions. Generally the closed source models performed better, and Mistral came out looking pretty badly: https://arxiv.org/pdf/2405.00332

    • extr 21 hours ago

      ??? Closed US frontier models are vastly more effective than anything OSS right now, the reason they didn’t compare is because they’re a different weight class (and therefore product) and it’s a bit unfair.

      We’re actually at a unique point right now where the gap is larger than it has been in some time. Consensus since the latest batch of releases is that we haven’t found the wall yet. 5.1 Max, Opus 4.5, and G3 are absolutely astounding models and unless you have unique requirements some way down the price/perf curve I would not even look at this release (which is fine!)

  • tarruda a day ago

    Here's what I understood from the blog post:

    - Mistral Large 3 is comparable with the previous Deepseek release.

    - Ministral 3 LLMs are comparable with older open LLMs of similar sizes.

    • constantcrying a day ago

      And implicit in this is that it compares very poorly to SOTA models. Do you disagree with that? Do you think these Models are beating SOTA and they did not include the benchmarks, because they forgot?

      • saubeidl 21 hours ago

        Those are SOTA for open models. It's a separate league from closed models entirely.

        • supermatt 21 hours ago

          > It's a separate league from closed models entirely.

          To be fair, the SOTA models aren't even a single LLM these days. They are doing all manner of tool use and specialised submodel calls behind the scenes - a far cry from in-model MoE.

      • tarruda a day ago

        > Do you disagree with that?

        I think that Qwen3 8B and 4B are SOTA for their size. The GPQA Diamond accuracy chart is weird: Both Qwen3 8B and 4B have higher scores, so they used this weid chart where "x" axis shows the number of output tokens. I missed the point of this.

  • crimsoneer a day ago

    If someone is using these models, they probably can't or won't use the existing SOTA models, so not sure how useful those comparisons actually are. "Here is a benchmark that makes us look bad from a model you can't use on a task you won't be undertaking" isn't actually helpful (and definitely not in a press release).

    • constantcrying a day ago

      Completely agree, that there are legitimate reasons to prefer comparison to e.g. deepeek models. But that doesn't change my point, we both agree that the comparisons would be extremely unfavorable.

      • Lapel2742 a day ago

        > that the comparisons would be extremely unfavorable.

        Why should they compare apples to oranges? Ministral3 Large costs ~1/10th of Sonnet 4.5. They clearly target different users. If you want a coding assistant you probably wouldn't choose this model for various reasons. There is place for more than only the benchmark king.

rvz 21 hours ago

> I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release,

Why would they? They know they can't compete against the heavily closed-source models.

They are not even comparing against GPT-OSS.

That is absolutely and shockingly bearish.