Comment by ColinEberhardt

Comment by ColinEberhardt 10 hours ago

1 reply

> We find testing and evals to be the hardest problem here …

I wonder what this means for the agents that people are deploying into production? Are they tested at all? Or just manual ad-hoc testing?

Sounds risky!

verdverm 6 hours ago

I'm curious what people are doing. We're still very much in the experimentation phase

> Sounds risky!

One of first attempts at building file system tools for my custom agent called `tree` and caught a few node_models. Blew up my context and cost me $5 in 60s. Fortunately I triggered the TPM rate-limit and the thing stopped