Comment by zamadatix
Comment by zamadatix 4 days ago
It'd certainly be odd if people were recommending old LLMs which score worse, even if marginally. That said, 4o is really a lot more usable than you're making it out to be.
The particular benchmark in the example is fungible but you have to pick something to make a representative example. No matter which you pick someone always has a reason "oh, it's not THAT benchmark you should look at". The benchmarks from the charts in the post exhibit the same as described above.
If someone was making new LLMs which were consistently solving Erdos problems at rapidly increasing rates then they'd be showing how it does that rather than showing how it scores the same or slightly better on benchmarks. Instead the progress is more like years since we were surprised LLMs were writing poetry to massage out an answer to one once. Maybe by the end of the year a few. The progress has definitely become very linear and relatively flat compared to roughly the initial 4o release. I'm just hoping that's a temporary thing rather than a sign it'll get even flatter.
Progress has not become linear. We've just hit the limits of what we can measure and explain easily.
One year ago coding agents could barely do decent auto-complete.
Now they can write whole applications.
That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.
Don't forget Llama4 led Lmarena and turned out to be very weak.