Comment by kittikitti

Comment by kittikitti 2 days ago

0 replies

These benchmarks are not really representative of what agents are capable of. The slow process of checking the weather through UI elements is not a good use case which is non-peer reviewed paper showcases.