Comment by simonw

Comment by simonw 16 hours ago

I would love to know the answer to that question!

One guess: maybe running multiple different fine-tuning style operations isn't actually that expensive - order of hundreds or thousands of dollars per run once you've trained the rest of the model.

I expect the majority of their evaluations are then automated, LLM-as-a-judge style. They presumably only manually test the best candidates from those automated runs.

ACCount37 15 hours ago

That's sort of true. SFT isn't too expensive - the per-token cost isn't far off from that of pre-training, and the pre-training dataset is massive compared to any SFT data. Although the SFT data is much more expensive to obtain.

RL is more expensive than SFT, in general, but still worthwhile because it does things SFT doesn't.

Automated evaluation is massive too - benchmarks are used extensively, including ones where LLMs are judged by older "reference" LLMs.

Using AI feedback directly in training is something that's done increasingly often too, but it's a bit tricky to get it right, and results in a lot of weirdness if you get it wrong.

Reply View 0 replies

Imnimo 15 hours ago

I guess I thought the pipeline was typically Pretraining -> SFT -> Reasoning RL, such that it would be expensive to test how changes to SFT affect the model you get out of Reasoning RL. Is it standard to do SFT as a final step?

Reply View 1 reply

ACCount37 14 hours ago

You can shuffle the steps around, but generally, the steps are where they are for a reason.
You don't teach an AI reasoning until you teach it instruction following. And RL in particular is expensive and inefficient, so it benefits from a solid SFT foundation.
Still, nothing really stops you from doing more SFT after reasoning RL, or mixing some SFT into pre-training, or even, madness warning, doing some reasoning RL in pre-training. Nothing but your own sanity and your compute budget. There are some benefits to this kind of mixed approach. And for research? Out-of-order is often "good enough".

Reply View | 0 replies