Comment by a-dub

i think in order to make this kind of argument you would need to be able to show all of the trajectories that are effectively reachable as a result of pre-training, and then how much effective pruning takes place as a result of total adjustment of the weights in response to one RL sample.