Comment by itkovian

Completely agree and think it’s a great summary. To summarize very succinctly; you’re chasing a moving target where the target changes based on how you move. There’s no ground truth to zero in on in value-based RL. You minimise a difference in which both sides of the equation have your APPROXIMATION in them.

I don’t think it’s hopeless though, I actually think RL is very close to working because what it lacked this whole time was a reliable world model/forward dynamics function (because then you don’t have to explore, you can plan). And now we’ve got that.

Comment by itkovian_