Comment by Jaxan
How do you know the correct direction? Isn’t the point of learning that the right path is unknown to start with?
How do you know the correct direction? Isn’t the point of learning that the right path is unknown to start with?
Your hard mode is exactly the situation that RL is used, because it requires neither a corpus of correct examples, nor insight into the structure of a good policy.
> How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)
You rule out all the stuff that doesn’t work.
Yes this is difficult and usually very costly. Credit assignment is a deep problem. But if you didn’t find yourself in a hard mode situation, you wouldn’t be using RL.
The correct solutions and the viable paths probably are known to the trainers, just not to the trainee. Training only on problems where the solution is unknown but verifiable sounds like the ultimate hard mode, and pretty hard to justify unless you have a model that's already saturated the space of problems with known solutions.
(Actually, "pretty hard to justify" might be understating it. How can we confidently extract any signal from a failure to solve a problem if we don't even know if the problem is solvable?)