Comment by ltbarcly3
This makes no sense. RL training data is predicated on past behavior of the agent. Whoever wrote this doesn't seem to fundamentally grasp what they are saying.
LLMs can be trained on an unsupervised way on static documents. That is really the key feature that lets them be as smart and effective as they are. If you had every other technology that LLMs are built on, and you didn't have hundreds of terabytes of text laying around, there would be no practical way to make them even a tiny tiny fraction as effective as they are currently.
> Whoever wrote this doesn't seem to fundamentally grasp what they are saying.
RL != only online learning.
There's a ton of research on offline and imitation-based RL where the training data isn't tied to an agents past policy - which is exactly what this article is pointing to.