Comment by ltbarcly3

Comment by ltbarcly3 17 hours ago

4 replies

This makes no sense. RL training data is predicated on past behavior of the agent. Whoever wrote this doesn't seem to fundamentally grasp what they are saying.

LLMs can be trained on an unsupervised way on static documents. That is really the key feature that lets them be as smart and effective as they are. If you had every other technology that LLMs are built on, and you didn't have hundreds of terabytes of text laying around, there would be no practical way to make them even a tiny tiny fraction as effective as they are currently.

sva_ 15 hours ago

> Whoever wrote this doesn't seem to fundamentally grasp what they are saying.

RL != only online learning.

There's a ton of research on offline and imitation-based RL where the training data isn't tied to an agents past policy - which is exactly what this article is pointing to.

  • physix 14 hours ago

    I'm not sufficiently familiar with the details on ML to assess the proposition made in the article.

    From my understanding, RL is a tuning approach on LLMs, so the outcome is still the same kind of beast, albeit with a different parameter set.

    So empirically, I actually thought that the lead companies would already be strongly focused on improving coding capabilities, since this is where LLMs are very effective, and where they have huge cashflows from token consumptions.

    So, either the motivation isn't there, or they're already doing something like that, or they know it's not as effective as the approaches they already have.

    I wonder which one it is.

    • sva_ 14 hours ago

      > From my understanding, RL is a tuning approach on LLMs,

      What you're referring to is actually just one application of RL (RLHF). RL itself is much more than that

      • physix an hour ago

        Actually I didn't. Correct me if I am wrong, but my understanding is that RL is still an LLM tuning approach, i.e. an optimization of its parameter set, no matter if it's done at scale or via HF.