Comment by nl

Comment by nl 4 days ago

3 replies

Progress has not become linear. We've just hit the limits of what we can measure and explain easily.

One year ago coding agents could barely do decent auto-complete.

Now they can write whole applications.

That's much more difficult to show than an ELO score based on how people like emjois and bold text in their chat responses.

Don't forget Llama4 led Lmarena and turned out to be very weak.

dajonker 4 days ago

You are equally understating past performance as you are overstating current performance.

One year ago I already ran qwen2.5-coder 7B locally for pretty decent autocomplete. And I still use it today as I haven't found anything better, having tried plenty of alternatives.

Today I let LLM agents write probably 60-80% of the code, but I frequently have to steer and correct it and that final 20% still takes 80% of the time.

anon373839 4 days ago

Much of these gains can be attributed to better tooling and harnesses around the models. Yes, the models also had to be retrained to work with the new tooling, but that doesn’t mean there was a step change in their general “intelligence” or capabilities. And sure enough, I’m seeing the same old flaws as always: frontier models fabricating info not present in the context, having blindness to what is present, getting into loops, failing to follow simple instructions…

  • nl 2 days ago

    > Much of these gains can be attributed to better tooling and harnesses around the models.

    This isn't the case.

    Take Claude Code and use it with Haiku, Sonnet and Opus. There's a huge difference in the capabilities of the models.

    > And sure enough, I’m seeing the same old flaws as always: frontier models fabricating info not present in the context, having blindness to what is present, getting into loops, failing to follow simple instructions…

    I don't know what frontier models you are using but Opus and Codex 5.2 don't ever do these things for me.