Comment by paraschopra

Comment by paraschopra a day ago

1 reply

Humans actually do both. We learn from on-policy by exploring consequences of our own behavior. But we also learn off-policy, say from expert demonstrations (but difference being we can tell good behaviors from bad, and learn from a filtered list of what we consider as good behaviors). In most, off-policy RL, a lot of behaviors are bad and yet they get into the training set and hence leading to slower training.

taneq a day ago

> difference being we can tell good behaviors from bad

Not always! That's what makes some expert demonstrations so fascinating, watching someone do something "completely wrong" (according to novice level 'best practice') and achieve superior results. Of course, sometimes this just means that you can get away with using that kind of technique (or making that kind of blunder) if you're just that good.