Comment by pu

The author's main point is that output token constraints should not be the root cause for poor performance in reasoning tests, as in many cases the LLMs did not even come close to exceeding their token budgets before giving up.

While that may be true, do we understand how LLMs behave according to token budget constraints? This might impact much simpler tasks as well. If we give them a task to list the names of all cities in the world according to population, do they spit out a python script if we give them a 4k output token budget but a full list if we give them 100k?

Comment by pu_pe