Comment by andy_xor_andrew

the magic thing about off-policy techniques such as Q-Learning is that they will converge on an optimal result even if they only ever see sub-optimal training data.

For example, you can use a dataset of chess games from agents that move totally randomly (with no strategy at all) and use that as an input for Q-Learning, and it will still converge on an optimal policy (albeit more slowly than if you had more high-quality inputs)

Ericson2314 2 days ago

I would think this being true is the definition of the task being "ergodic" (distorting that term slightly, maybe). But I would also expect non-ergodic tasks to exist.

Reply View 0 replies