Comment by GistNoesis

The stated problem is getting Off-policy RL to work, aka discover a policy smarter than the one it was shown in its dataset.

If I understand correctly, they show random play, and expect perfect play to emerge from the naive Q-learning training objective.

In layman's term, they expect the algorithm to observe random smashing of keys on a piano, and produce a full-fledge symphony.

The main reason it doesn't work is because it's fundamentally some Out Of Distribution training.

Neural networks works best in interpolation mode. When you get into Out Of Distribution mode, aka extrapolation mode, you rely on some additional regularization.

One such regularization you can add, is to trying to predict the next observations, and build an internal model whose features help make the decision for the next action. Other regularization may be to unroll in your head multiple actions in a row and use the prediction as a training signal. But all these strategies are no longer the domain of the "model-free" RL they are trying to do.

Other regularization, can be making the decision function more smooth, often by reducing the number of parameters (which goes against the idea of scaling).

The adage is "no plan survive first contact with the enemy". There needs to be some form of exploration. You must somehow learn about the areas of the environment where you need to operate. Without interaction from the environment, one way to do this is to "grok" a simple model of the environment (fitting perfectly all observation (by searching for it) so as to build a perfect simulator), and learn on-policy from this simulation.

Alternatively if you have already some not so bad demonstrations in your training dataset, you can get it to work a little better than the policy of the dataset, and that's why it seems promising but it's really not because it's just relying of all the various facets of the complexity already present in the dataset.

If you allow some iterative gathering phase of information from the environment, interlaced with some off-policy training, it's the well known domain of Bayesian methods to allow efficient exploration of the space like "kriging", "gaussian process regression", multi-arm bandits and "energy-based modeling", which allow you to trade more compute for sample efficiency.

The principle being you try model what you know and don't know about the environment. There is a trade-off between the uncertainty that you have because you have not explored the area of space yet and the uncertainty because the model don't fit the observation perfectly yet. You force yourself to explore unknown area so as not to have regrets (Thomson Sampling) ) but still sample promising regions of the space.

In contrast to on-policy learning, the "bayesian exploration learning" learn in an off-policy fashion all possible policies. Your robot doesn't only learn to go from A to B in the fastest way. Instead it explicitly tries to learn various locomotion policies, like trotting or galloping, and other gaits and use them to go from A to B, but spend more time perfecting galloping as it seems that galloping is faster than trotting.

Possibly you can also learn adaptive strategy like they do in sim-to-real experiments where your learned policy is based on unknown parameters like how much weight your robot carry, and your learned policy will estimate on-the-fly these unknown parameters to become more robust (aka filling in the missing parameters to let the optimal "Model Predictive Control" work).