Comment by CityOfThrowaway
Comment by CityOfThrowaway a day ago
This paper doesn't make any sense. They are claiming LLMs are bad at this set of tasks, but the reality is that they built a bad agent.
I bet it's possible to nearly ace this using existing LLMs by designing a better agent. Better tool structure, better scaffolding, better prompting.
LLMs are not gods, they are tools that require good engineering to achieve good outcomes.
How is that an argument at all? Of course if you could build a better agent that could solve every problem the outcome of the paper would be "this tool performs well at this"