Comment by janalsncm

Comment by janalsncm 14 hours ago

1 reply

One detail the OP glosses over is the increasing costs of RL as the sequence length increases. If we’re just reasoning through an simple arithmetic problem, it’s a pretty manageable number of reasoning tokens and answer tokens.

For a complete piece of software the answer might be 10 million tokens, and that doesn’t even count the reasoning.

Now imagine that there was a mistake at some point. The model will need to go back to fix it, and understand the cascade of things the bugfix changed. It might be possible to keep that all in the context window but that seems like it won’t scale.

criemen an hour ago

I'd expect that's manageable by some sort of agent-of-agent pattern. You have a high-level planning instance that calls upon fresh LLM instances (new context window!) for executing more targeted tasks or bug-fixes.

Currently, an LLM with everything under the sun in the context window behaves rather poorly and gets confused by that, even if we're not exceeding the context window length. Although it'd be certainly also interesting to train for increasing the maximum _actually_ usable context window length, I don't know how feasible that would be.