Comment by platelminto

Comment by platelminto 2 days ago

0 replies

I think this removes any amount of human-labeled data: no RLHF and stuff like that. You can use their technique to create an unsupervised reward model, and use that model to RL your way to having a useful assistant LLM.

The paper is very accessible (it's mostly written by Anthropic researchers), and Section 4 summarises their findings really well. They were themselves really surprised by the results:

> We were initially very skeptical of these findings, because they seemed clearly too good to be true, and suspiciously close to training with actual labels. To ensure we didn’t accidentally train on the labels, (1) we re-ran the experiment several times on different datasets, (2) we copied the dataset into a new file, excluding any labels before re-running our algorithm with that file, and (3) one coauthor independently replicated the findings on the Claude 3.5 Haiku base model using a different codebase.

(emphasis mine)