Comment by paraschopra
Comment by paraschopra a day ago
Humans actually do both. We learn from on-policy by exploring consequences of our own behavior. But we also learn off-policy, say from expert demonstrations (but difference being we can tell good behaviors from bad, and learn from a filtered list of what we consider as good behaviors). In most, off-policy RL, a lot of behaviors are bad and yet they get into the training set and hence leading to slower training.
> difference being we can tell good behaviors from bad
Not always! That's what makes some expert demonstrations so fascinating, watching someone do something "completely wrong" (according to novice level 'best practice') and achieve superior results. Of course, sometimes this just means that you can get away with using that kind of technique (or making that kind of blunder) if you're just that good.