Comment by zamadatix

Comment by zamadatix 11 hours ago

Can anyone help me understand the "Number of Agent Turns" vs "SWE-Bench Pro (%)" figure? I.e. what does the spread of Qwen3-Coder-Next from ~50 to ~280 agent turns represent for a fixed score of 44.3%: that sometimes it takes that spread of agent turns to achieve said fixed score for the given model?

yorwba 10 hours ago

SWE-Bench Pro consists of 1865 tasks. https://arxiv.org/abs/2509.16941 Qwen3-Coder-Next solved 44.3% (826 or 827) of these tasks. To solve a single task, it took between ≈50 and ≈280 agent turns, ≈150 on average. In other words, a single pass through the dataset took ≈280000 agent turns. Kimi-K2.5 solved ≈84 fewer tasks, but also only took about a third as many agent turns.

Reply View 2 replies

zamadatix 9 hours ago

Ah, a spread of the individual tests makes plenty of sense! Many thanks (same goes to the other comments).

Reply View | 0 replies
regularfry 9 hours ago

If this is genuinely better than K2.5 even at a third the speed then my openrouter credits are going to go unused.

Reply View | 0 replies

edude03 11 hours ago

Essentially the more turns you have the more the agent is likely to fail since the error compounds per turn. Agentic model are tuned for “long horizon tasks” ie being able to go many many turns on the same problem without failing.

Reply View 3 replies

zamadatix 11 hours ago

Much appreciated, but I mean more around "what do the error bars in the figure represent" than what the turn scaling itself is.

Reply View | 2 replies
- esafak 10 hours ago
  
  For the tasks in SWE-Bench Pro they obtained a distribution of agent turns, summarized as the box plot. The box likely describes the inter-quartile range while the whiskers describe the some other range. You'd have to read their report to be sure. https://en.wikipedia.org/wiki/Box_plot
  
  Reply View | 0 replies
- jsnell 10 hours ago
  
  That's a box plot, so those are not error bars but a visualization of the distribution of a metric (min, max, median, 25th percentile, 75th percentile).
  The benchmark consists of a bunch of tasks. The chart shows the distribution of the number of turns taken over all those tasks.
  
  Reply View | 0 replies