Comment by ppsreejith

> If every major APM vendor and dozens of startups release agents in the next year, it will be difficult for customers to tell what’s snake oil or what’s actually useful. One approach, also seen in the financial space, is having open benchmarks for assessing how well agents can answer questions and show domain-specific knowledge.

IME benchmarks, though valuable, don't fully reflect the real world, often only reflecting the easily quantifiable. The best way is to be able to quickly try out an agent to see how it performs on your work environment. Sort of like having a private test set you can try different agents on to see how they perform in the real world quickly.

Disclaimer: I'm building MinusX, a data science agent (github.com/minusxai/minusx)