Comment by mwigdahl
Comment by mwigdahl 3 days ago
The problem is that you're talking about a multistep process where each step beyond the first depends on the particular path the agent starts down, along with human input that's going to vary at each step.
I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.
It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.
https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...
What you're getting at is the heart of the problem with the LLM hype train though, isn't it?
"We should have rigorous evaluations of whether or not [thing] works." seems like an incredibly obvious thought.
But in the realm of LLM-enabled use cases they're also expensive. You'd need to recruit dozens, perhaps even hundreds of developers to do this, with extensive observation and rating of the results.
So rather than actually try to measure the efficacy, we just get blog posts with cherry-picked example of "LLM does something cool". Everything is just anecdata.
This is also the biggest barrier to actual LLM adoption for many, many applications. The gap between "it does something REALLY IMPRESSIVE 40% of the time and shits the bed otherwise" and "production system" is a yawning chasm.