Comment by macleginn

Comment by macleginn 3 days ago

In the limit, the "happy" case (positive reward), policy gradients boil down to performing more or less the same update as the usual supervised strategy for each generated token (or some subset of those if we use sampling). In the unhappy case, they penalise the model for selecting particular tokens in particular circumstances -- this is not something you can normally do with supervised learning, but it is unclear to what extent this is helpful (if a bad and a good answer share a prefix, it will be upvoted in one case and penalised in another case, not in the same exact way but still). So during on-policy learning we desperately need the model to stumble on correct answers often enough, and this can only happen if the model knows how to solve the problem to begin with, otherwise the search space is too big. In other words, while in supervised learning we moved away from providing models with inductive biases and trusting them to figure out everything by themselves, in RL this does not really seem possible.

sgsjchs 3 days ago

The trick is to provide dense rewards, i.e. not only once full goal is reached, but a little bit for every random flailing of the agent in the approximately correct direction.

Reply View 4 replies

thegeomaster 3 days ago

Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears.
[0]: https://arxiv.org/abs/2501.12948

Reply View | 0 replies
Jaxan 3 days ago

How do you know the correct direction? Isn’t the point of learning that the right path is unknown to start with?

Reply View | 2 replies
- jsnell 3 days ago
  
  The correct solutions and the viable paths probably are known to the trainers, just not to the trainee. Training only on problems where the solution is unknown but verifiable sounds like the ultimate hard mode, and pretty hard to justify unless you have a model that's already saturated the space of problems with known solutions.
  (Actually, "pretty hard to justify" might be understating it. How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)
  
  Reply View | 1 reply
  
  robotresearcher 3 days ago
  
  Your hard mode is exactly the situation that RL is used, because it requires neither a corpus of correct examples, nor insight into the structure of a good policy.
  > How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)
  You rule out all the stuff that doesn’t work.
  Yes this is difficult and usually very costly. Credit assignment is a deep problem. But if you didn’t find yourself in a hard mode situation, you wouldn’t be using RL.
  
  Reply View | 0 replies