Comment by criemen
They have a point about RLs increasing importance. From my outsider perspective, all major advances in model capabilities in the last period of time come from RL, so it's natural to expect that we can "milk" RL more for performance gains. Scaling RL is a natural way to attempt that.
What I don't necessarily see is the generalization factor - say, we improve software engineering and math performance through RL learning (probably easier for software engineering than math due to available training corpus). If that generalization factor doesn't hold, due the economics still work out? An expert-level software model would be useful to our profession, sure, but would it be enough to recoup the training costs if it's not applicable to other industries?
One detail the OP glosses over is the increasing costs of RL as the sequence length increases. If we’re just reasoning through an simple arithmetic problem, it’s a pretty manageable number of reasoning tokens and answer tokens.
For a complete piece of software the answer might be 10 million tokens, and that doesn’t even count the reasoning.
Now imagine that there was a mistake at some point. The model will need to go back to fix it, and understand the cascade of things the bugfix changed. It might be possible to keep that all in the context window but that seems like it won’t scale.