platelminto 2 days ago

I think this removes any amount of human-labeled data: no RLHF and stuff like that. You can use their technique to create an unsupervised reward model, and use that model to RL your way to having a useful assistant LLM.

The paper is very accessible (it's mostly written by Anthropic researchers), and Section 4 summarises their findings really well. They were themselves really surprised by the results:

> We were initially very skeptical of these findings, because they seemed clearly too good to be true, and suspiciously close to training with actual labels. To ensure we didn’t accidentally train on the labels, (1) we re-ran the experiment several times on different datasets, (2) we copied the dataset into a new file, excluding any labels before re-running our algorithm with that file, and (3) one coauthor independently replicated the findings on the Claude 3.5 Haiku base model using a different codebase.

(emphasis mine)

gojomo 2 days ago

> far from novel

Techniques can be arbitrarily old & common in industry, but still be a novel academic paper, first to document & evaluate key aspects in that separate (& often lagging) canon.