Comment by khaledh
Comment by khaledh 3 days ago
Agreed. The methodology needed here is something like an A/B test, with quantifiable metrics that demonstrate the effectiveness of the tool. And to do it not just once, but many times under different scenarios so that it demonstrates statistical significance.
The most challenging part when working with coding agents is that they seem to do well initially on a small code base with low complexity. Once the codebase gets bigger with lots of non-trivial connections and patterns, they almost always experience tunnel vision when asked to do anything non-trivial, leading to increased tech debt.
The problem is that you're talking about a multistep process where each step beyond the first depends on the particular path the agent starts down, along with human input that's going to vary at each step.
I made a crude first stab at an approach that at least uses similar steps and structure to compare the effectiveness of AI agents. My approach was used on a small toy problem, but one that was complex enough the agents couldn't one-shot and required error correction.
It was enough to show significant differences, but scaling this to larger projects and multiple runs would be pretty difficult.
https://mattwigdahl.substack.com/p/claude-code-vs-codex-cli-...