Comment by Zababa

>E.g. gemini-3-pro tops the lmarena text chart today at 1488 vs 1346 for gpt-4o-2024-05-13. That's a win rate of 70% (where 50% is equal chance of winning) over 1.5 years. Meanwhile, even the open weights stuff OpenAI gave away last summer scores between the two.

I think in that specific case that says more about LMArena than about the newer models. Remember that GPT 4o was so specifically loved by people that when GPT 5 replaced there was lots of backlash against OpenAI.

One of the popular benchmarks right now is METR which shows some real improvement with newer models, like Opus 4.5. Another way of getting data is anecdotes, lots of people are really impressed with Opus 4.5 and Codex 5.2 (but they're hard distangle from people getting better with those tools, the scaffolding (Claude code, Codex) getting better, and lots of other stuff). SWEBench is still not saturated (less than 75% I think).