Comment by Al-Khwarizmi
Comment by Al-Khwarizmi 5 days ago
A great writeup, just let me make two nitpicks (not to diminish the awesome effort of the author, but just in case they wish to take suggestions).
1. I think the paper underemphasizes the relevance of BERT. While from today's LLM-centric perspective it may seem minor because it's in a different branch of the tech tree, it smashed multiple benchmarks at the time and made previous approaches to many NLP analysis tasks immediately obsolete. While I don't much like citation counts as a metric, a testament of its impact is that it has more than 145K citations - in the same order of magnitude as the Transformers paper (197K) and many more than GPT-1 (16K). GPT-1 would ultimately be a landmark paper due to what came afterwards, but at the time it wasn't that useful due to being more oriented to generation (but not that good at it) and, IIRC, not really publicly available (it was technically open source but not posted at a repository or with a framework that allowed you to actually run it). It's also worth remarking that for many NLP tasks that are not generative (things like NER, parsing, sentence/document classification, etc.) often the best alternative is still a BERT-like model even in 2025.
2. The writing kind of implies that modern LLMs were something that was consciously sought after ("the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today"). The truth is that no one in the field expected modern LLMs. The story was more like the OpenAI researchers noticing that GPT-2 was good at generating random text that looked fluent, and thought "if we make it bigger it will do that even better". But it turned out that not only it generated better random text, but it started being able to actually state real facts (in spite of the occasional hallucinations), answer questions, translate, be creative, etc. All those emergent abilities that are the basis of "commodity LLMs most people interact with today" were a totally unexpected development. In fact, it is still poorly understood why they work.
(2) is not quite right. I created ULMFiT specifically because I thought a language model pretrained on a large general corpus then fine-tuned was the right way to go for creating generally capable NLP models. It wasn't an accident.
The fact that, sometime later, GPT-2 could do zero-shot generation was indeed something a lot of folks got excited about, but that was actually not the correct path. The 3-step ULMFiT approach (causal LM training on general corpus then specialised corpus, then classification task fine tuning) was what ChatGPT 3.5 Instruct used, which formed the basis of the first ChatGPT product.
So although it took quite a while to take off, the idea of the LLM was quite intentional and has largely developed as I planned (even although at the time almost no-one else felt the same way; luckily Alec Radford did, however! He told me in 2018 that reading the ULMFiT paper was a big "omg" moment for him and he set to work on GPT right away.)
PS: On (1), if I may take a moment to highlight my team's recent work, we updated BERT last year to create ModernBERT, which showed that yes, this approach still has legs. Our models have had >1.5m downloads and there's >2k fine-tunes and variants of it now on Huggingface: https://huggingface.co/models?search=modernbert