Comment by hereme888
recent results like PEFT-Bench (arxiv.org/abs/2511.21285) found that while SFT is efficient for formatting, it actually degraded Llama-3-8B's reasoning on math and code tasks compared to the base model.
So is RL required to preserve those logic circuits?
There seems to be a trade-off in compute-efficiency and format vs intelligence
Not necessarily. The reason why SFT can hurt performance is often the gap between the data and the capabilities.
Imagine forcing someone who never used chopsticks to eat with the chopsticks. The results wouldn't be good - the instruction "use chopsticks" has taken effect, but an underlying "chopstick use" capability isn't there.
If your SFT data pushes your LLM too far past its capabilities? It'll teach it to try doing a thing it can't do.
If your SFT traces assume your LLM can do 10 digit multiplication, the LLM wouldn't learn 10 digit multiplication from them. It'll learn to attempt 10 digit multiplication, and it'll fail.