Comment by criemen

Comment by criemen a day ago

2 replies

They have a point about RLs increasing importance. From my outsider perspective, all major advances in model capabilities in the last period of time come from RL, so it's natural to expect that we can "milk" RL more for performance gains. Scaling RL is a natural way to attempt that.

What I don't necessarily see is the generalization factor - say, we improve software engineering and math performance through RL learning (probably easier for software engineering than math due to available training corpus). If that generalization factor doesn't hold, due the economics still work out? An expert-level software model would be useful to our profession, sure, but would it be enough to recoup the training costs if it's not applicable to other industries?

janalsncm 19 hours ago

One detail the OP glosses over is the increasing costs of RL as the sequence length increases. If we’re just reasoning through an simple arithmetic problem, it’s a pretty manageable number of reasoning tokens and answer tokens.

For a complete piece of software the answer might be 10 million tokens, and that doesn’t even count the reasoning.

Now imagine that there was a mistake at some point. The model will need to go back to fix it, and understand the cascade of things the bugfix changed. It might be possible to keep that all in the context window but that seems like it won’t scale.

  • criemen 6 hours ago

    I'd expect that's manageable by some sort of agent-of-agent pattern. You have a high-level planning instance that calls upon fresh LLM instances (new context window!) for executing more targeted tasks or bug-fixes.

    Currently, an LLM with everything under the sun in the context window behaves rather poorly and gets confused by that, even if we're not exceeding the context window length. Although it'd be certainly also interesting to train for increasing the maximum _actually_ usable context window length, I don't know how feasible that would be.