Comment by suchintan
Definitely need a newer benchmark.
I couldn't find where browser-use published their run results (expected to see it here https://github.com/browser-use/eval)
We went ahead and published our full run at https://eval.skyvern.com so our run could be independently audited