Comment by jph00
(2) is not quite right. I created ULMFiT specifically because I thought a language model pretrained on a large general corpus then fine-tuned was the right way to go for creating generally capable NLP models. It wasn't an accident.
The fact that, sometime later, GPT-2 could do zero-shot generation was indeed something a lot of folks got excited about, but that was actually not the correct path. The 3-step ULMFiT approach (causal LM training on general corpus then specialised corpus, then classification task fine tuning) was what ChatGPT 3.5 Instruct used, which formed the basis of the first ChatGPT product.
So although it took quite a while to take off, the idea of the LLM was quite intentional and has largely developed as I planned (even although at the time almost no-one else felt the same way; luckily Alec Radford did, however! He told me in 2018 that reading the ULMFiT paper was a big "omg" moment for him and he set to work on GPT right away.)
PS: On (1), if I may take a moment to highlight my team's recent work, we updated BERT last year to create ModernBERT, which showed that yes, this approach still has legs. Our models have had >1.5m downloads and there's >2k fine-tunes and variants of it now on Huggingface: https://huggingface.co/models?search=modernbert
Point taken (both from you and the sibling comment mentioning Phil Blunsom), I should know better than carelessly dropping such broad generalizations as "no one in the field expected..." :)
Still, I think only a tiny minority of the field expected it, and I think it was also clear from the messaging at the time that the OpenAI researchers who saw how GPT-3 (pre-instruct) started solving arbitrary tasks and displaying emergent abilities were surprised by that. Maybe they did have an ultimate goal in mind of creating a general-purpose system via next word prediction, but I don't think they expected it so soon and just by scaling GPT-2.