Comment by lalaland1125
Comment by lalaland1125 2 days ago
This blog post is unfortunately missing what I consider the bigger reason why Q learning is not scalable:
As horizon increases, the number of possible states (usually) increases exponentially. This means you require exponentially increasing data to have a hope of training a Q that can handle those states.
This is less of an issue for on policy learning, because only near policy states are important, and on policy learning explicitly only samples those states. So even though there are exponential possible states your training data is laser focused on the important ones.
I think the article's analysis of overapproximation bias is correct. The issue is that due to the Max operator in the Q learning noise is amplified over timesteps. Some methods to reduce this bias, such as https://arxiv.org/abs/1509.06461 were successful in improving the RL agents performance. Studies have found that this happens even more for the states that the network hasn't visited many times.
An exponential number of states only matters if there is no pattern to them. If there is some structure that the network can learn then it can perform well. This is a strength of deep learning, not a weakness. The trick is getting the right training objective, which the article claims q learning isn't.
I do wonder if MuZero and other model based RL systems are the solution to the author's concerns. MuZero can reanalyze prior trajectories to improve training efficiency. The Monte Carlo Tree Search (MCTS) is a principled way to perform horizon reduction by unrolling the model multiple steps. The max operator in MCTS could cause similar issues but the search progressing deeper counteracts this.