Comment by kittikitti
Comment by kittikitti 2 days ago
These benchmarks are not really representative of what agents are capable of. The slow process of checking the weather through UI elements is not a good use case which is non-peer reviewed paper showcases.