Comment by ACCount37

Comment by ACCount37 15 hours ago

0 replies

You can shuffle the steps around, but generally, the steps are where they are for a reason.

You don't teach an AI reasoning until you teach it instruction following. And RL in particular is expensive and inefficient, so it benefits from a solid SFT foundation.

Still, nothing really stops you from doing more SFT after reasoning RL, or mixing some SFT into pre-training, or even, madness warning, doing some reasoning RL in pre-training. Nothing but your own sanity and your compute budget. There are some benefits to this kind of mixed approach. And for research? Out-of-order is often "good enough".