Comment by falcor84
A regular single-pass LLM indeed cannot step back, but newer ones like o1/o3/Marco-o1/QwQ can, and a larger agentic system composed of multiple LLMs definitely can. There is no "fundamental" limitation here. And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models), the sky's the limit. I'd be very bullish about Deepmind, once they fully enter this race.
> And once we start training these larger systems from the ground up via full reinforcement learning (rather than composing existing models),
Agree with this totally.
I wouldn't call what the CoT models are doing exactly being able to step back - their "stepping back" still dumps tokens into the output, so it is still burdened with seeing all of these failed attempts as it searches for the right one. But my intuition on this can be wrong, and it's a much more advanced reasoning process than what "last-gen" (non-CoT) does, so I can see your point.
For an agentic system composed of multiple LLMs, I would strongly disagree if the LLMs are last-gen. In my experience, it is very hard to prompt a non-CoT LLM into rejecting an upstream assumption without making it paranoid and rejecting valid assumptions as well. This makes it hard to effectively create a robust agentic system that can self-correct.
I think that's different if the agents are o1-level, but I think it's hard to appreciate just how costly and slow doing this would be. Agents consume tokens like candy with all the back-and-forth, so a surprising number of tasks become economically infeasible.
(It seems everyone is waiting for an inference perf breakthrough that may or may not come.)