Comment by jrflowers
This is a good point. They tested software that exists rather than software that you’ve imagined in your head, which is a curious decision.
The choice of test is interesting as well. Instead of doing CRM and confidentiality tests they could have done a “quickly generate a listicle of plausible-sounding ant facts” test, which an LLM would surely be more likely to pass.
They tested one specific agent implementation that they themselves made, and made sweeping claims about LLM agents.