Comment by gjm11

This rebuttal-of-a-rebuttal looks to me as if it gets one (fairly important) thing right but pretty much everything else wrong. (Not all in the same direction; the rebuttal^2 fails to point out what seems to me to be the single biggest deficiency in the rebuttal.)

The thing it gets right: the "Illusion of illusion" rebuttal claims that in the original "Illusion of Thinking" paper's version of the Towers of Hanoi problem, "The authors’ evaluation format requires outputting the full sequence of moves at each step, leading to quadratic token growth"; this doesn't seem to be true at all, and this "Beyond Token Limits" rebuttal^2 is correct to point it out.

(This implies, in particular, that there's something fishy in the IoI rebuttal's little table showing where 5(2^n-1)^2 exceeds the token budget, which they claim explains the alleged "collapse" at roughly those points.)

Things it gets wrong:

"The rebuttal conflates solution length with computational difficulty". This is just flatly false. The IoI rebuttal explicitly makes pretty much the same points as the BTL rebuttal^2 does here.

"The rebuttal paper’s own data contradicts its thesis. Its own data shows that models can generate long sequences when they choose to, but in the findings of the original Apple paper, it finds that models systematically choose NOT to generate longer reasoning traces on harder problems, effectively just giving up." I don't see anything in the rebuttal that "shows that models can generate long sequences when they choose to". What the rebuttal finds is that (specifically for the ToH problem) if you allow the models to answer by describing the procedure rather than enumerating all its steps, they can do it. The original paper didn't allow them to do this. There's no contradiction here.

"It instead completely ignores this finding [that once solutions reach a certain level of difficulty the models give up trying to give complete answers] and offers no explanation as to why models would systematically reduce computational effort when faced with harder problems."

The rebuttal doesn't completely ignore this finding. That little table of alleged ToH token counts is precisely targeted at this finding. (It seems like it's wrong, which is important, but the problem here isn't that the paper ignores this issue, it's that it has a mistake that invalidates how it addresses the issue.)

Things that a good rebuttal^2 should point out but this rebuttal completely ignores:

The most glaring one, to me, is that the rebuttal focuses almost entirely on the Tower of Hanoi, where there's a plausible "the only problem is that there aren't enough tokens" issue, and largely ignores the other problems that the original paper also claims to find "collapse" problems with. Maybe token-limit issues are also sufficient explanation for the problems with other models (e.g., if something is effectively only solvable by exhaustive search, then maybe there aren't enough tokens for the model to do that search in) but the rebuttal never actually makes that argument (e.g., by estimating how many tokens are needed to do the relevant exhaustive search).

The rebuttal does point out what if correct is a serious problem with the original paper's treatment of the "River Crossing" problem (apparently the problem they asked the AI to solve is literally unsolvable for many of the cases they put to it), but the unsolvability starts at N=6 and the original paper finds that the models were unable to solve the problem starting at N=3.

(Anecdata: I had a go at solving the River Crossing problem for N=3 myself. I made a stupid mistake that stopped me finding a solution and didn't have sufficient patience to track it down. My guess is that if you could spawn many independent copies of me and ask them all to solve it, probably about 2/3 would solve it and 1/3 would screw up in something like the way actual-me did. If I actually needed to solve it for larger N I'd write some code, which I suspect the AI models could do about as well as I could. For what it's worth, I think the amount of text-editor scribbling I did while not solving the puzzle was quite a bit less than the thinking-token limits these models had.)

The rebuttal^2 does complain about the "narrow focus" of the rebuttal, but it means something else by that.