Comment by jph00

Comment by jph00 5 days ago

3 replies

(2) is not quite right. I created ULMFiT specifically because I thought a language model pretrained on a large general corpus then fine-tuned was the right way to go for creating generally capable NLP models. It wasn't an accident.

The fact that, sometime later, GPT-2 could do zero-shot generation was indeed something a lot of folks got excited about, but that was actually not the correct path. The 3-step ULMFiT approach (causal LM training on general corpus then specialised corpus, then classification task fine tuning) was what ChatGPT 3.5 Instruct used, which formed the basis of the first ChatGPT product.

So although it took quite a while to take off, the idea of the LLM was quite intentional and has largely developed as I planned (even although at the time almost no-one else felt the same way; luckily Alec Radford did, however! He told me in 2018 that reading the ULMFiT paper was a big "omg" moment for him and he set to work on GPT right away.)

PS: On (1), if I may take a moment to highlight my team's recent work, we updated BERT last year to create ModernBERT, which showed that yes, this approach still has legs. Our models have had >1.5m downloads and there's >2k fine-tunes and variants of it now on Huggingface: https://huggingface.co/models?search=modernbert

Al-Khwarizmi 5 days ago

Point taken (both from you and the sibling comment mentioning Phil Blunsom), I should know better than carelessly dropping such broad generalizations as "no one in the field expected..." :)

Still, I think only a tiny minority of the field expected it, and I think it was also clear from the messaging at the time that the OpenAI researchers who saw how GPT-3 (pre-instruct) started solving arbitrary tasks and displaying emergent abilities were surprised by that. Maybe they did have an ultimate goal in mind of creating a general-purpose system via next word prediction, but I don't think they expected it so soon and just by scaling GPT-2.

HarHarVeryFunny 5 days ago

When you say "classification task fine tuning", are you referring to RLHF?

RLHF seems to have been the critical piece that "aligned" the otherwise rather wild output of a purely "causally" (next-token prediction) trained LLM with what a human expects in terms of conversational turn taking (e.g. Q & A) and instruction following, as well as more general preferences/expectations.

alansaber 5 days ago

You mention that encoder only approaches like bmodernBERT still have legs, would you mind sharing some applications aside from some niche NER? Genuinely curious