Comment by andy_xor_andrew
Comment by andy_xor_andrew 2 days ago
the magic thing about off-policy techniques such as Q-Learning is that they will converge on an optimal result even if they only ever see sub-optimal training data.
For example, you can use a dataset of chess games from agents that move totally randomly (with no strategy at all) and use that as an input for Q-Learning, and it will still converge on an optimal policy (albeit more slowly than if you had more high-quality inputs)
I would think this being true is the definition of the task being "ergodic" (distorting that term slightly, maybe). But I would also expect non-ergodic tasks to exist.