Comment by nodja

Comment by nodja 13 hours ago

6 replies

Pre-training is just training, it got the name because most models have a post-training stage so to differentiate people call it pre-training.

Pre-training: You train on a vast amount of data, as varied and high quality as possible, this will determine the distribution the model can operate with, so LLMs are usually trained on a curated dataset of the whole internet, the output of the pre-training is usually called the base model.

Post-training: You narrow down the task by training on the specific model needs you want. You can do this through several ways:

- Supervised Finetuning (SFT): Training on a strict high quality dataset of the task you want. For example if you wanted a summarization model, you'd finetune the model on high quality text->summary pairs and the model would be able to summarize much better than the base model.

- Reinforcement Learning (RL): You train a separate model that ranks outputs, then use it to rate the output of the model, then use that data to train the model.

- Direct Preference Optimizaton (DPO): You have pairs of good/bad generations and use them to align the model towards/away the kinds of responses you want.

Post-training is what makes the models able to be easily used, the most common is instruction tuning that teaches to model to talk in turns, but post-training can be used for anything. E.g. if you want a translation model that always translates a certain way, or a model that knows how to use tools, etc. you'd achieve all that through post-training. Post-training is where most of the secret sauce in current models is nowadays.

cocogoatmain 13 hours ago

Want to also add that the model doesn’t know how to respond in a user-> assistant style conversation after it’s pretraining, and it’s a pure text predictor (look at the open source base models)

There’s also what is being called mid-training where the model is trained on high(er) quality traces and acts as a bridge between pre and post training

fzzzy an hour ago

- Reinforcement learning with verifiable rewards (RLVR): instead of using a grader model you use a domain that can be deterministically graded, such as math problems.

mrweasel 3 hours ago

If pre-training is just training, then how on earth can OpenAI not have "a successful pre-training run"? The word successful indicates that they tried, but failed.

It might be me misunderstanding how this works, but I assumed that the training phase was fairly reproducible. You might get different results on each run, do to changes in the input, but not massively so. If OpenAI can't continuously and reliably train new models, then they are even more overvalued that I previously assumed.

  • nodja 2 hours ago

    Because success for them doesn't mean it works, it means it works much better than what they currently have. If a 1% improvement comes at the cost of spending 10x more on training and 2x more on inference then you're failing at runs. (numbers out of ass)

    • mrweasel an hour ago

      That makes sense. It's not that the training didn't complete or returned a moronic model, but the capabilities have plateaued.

  • immibis 3 hours ago

    Maybe this has something to do with why they're declaring "code red".