Comment by skybrian

Comment by skybrian a day ago

1 reply

Publishing new benchmarks seems useful? If LLM’s improve on this benchmark (and they probably will, like they have on many others) then they’ll need less work on prompting, etc.

CityOfThrowaway a day ago

The benchmark is useful, but the conclusion of the write-up is that current generation LLMs can't solve the problem. That's not a valid conclusion to draw. The results here tell us mostly about the skill of the agent-designer, not the capabilities of the model.