Comment by Jcampuzano2
Comment by Jcampuzano2 3 days ago
Claude Code. They mention they are using claude codes CLI in the benchmark, and claude code changes constantly.
I wouldn't be surprised if the thing this is actually testing is benchmarking just claude codes constant system prompt changes.
I wouldn't really trust this to be able to benchmark opus itself.