Comment by whatshisface
Comment by whatshisface 2 days ago
The benefit of off-policy learning is fundamentally limited by the fact that data from ineffective early exploration isn't that useful for improving on later more refined policies. It's clear if you think of a few examples: chess blunders, spasmodic movement, or failing to solve a puzzle. This becomes especially clear once you realize that data only becomes off-policy when it describes something the policy would not do. I think the solution to this problem is (unfortunately) related to the need for better generalization / sample efficiency.
Doesn't this claim prove too much? What about the cited dog that walked in 20 minutes with off-policy learning? Or are you making a more nuanced point?