Comment by skybrian
Publishing new benchmarks seems useful? If LLM’s improve on this benchmark (and they probably will, like they have on many others) then they’ll need less work on prompting, etc.
Publishing new benchmarks seems useful? If LLM’s improve on this benchmark (and they probably will, like they have on many others) then they’ll need less work on prompting, etc.
The benchmark is useful, but the conclusion of the write-up is that current generation LLMs can't solve the problem. That's not a valid conclusion to draw. The results here tell us mostly about the skill of the agent-designer, not the capabilities of the model.