Comment by nl
Progress has not become linear. We've just hit the limits of what we can measure and explain easily.
One year ago coding agents could barely do decent auto-complete.
Now they can write whole applications.
That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.
Don't forget Llama4 led Lmarena and turned out to be very weak.
You are equally understating past performance as you are overstating current performance.
One year ago I already ran qwen2.5-coder 7B locally for pretty decent autocomplete. And I still use it today as I haven't found anything better, having tried plenty of alternatives.
Today I let LLM agents write probably 60-80% of the code, but I frequently have to steer and correct it and that final 20% still takes 80% of the time.