Comment by hereme888

Comment by hereme888 3 days ago

recent results like PEFT-Bench (arxiv.org/abs/2511.21285) found that while SFT is efficient for formatting, it actually degraded Llama-3-8B's reasoning on math and code tasks compared to the base model.

So is RL required to preserve those logic circuits?

There seems to be a trade-off in compute-efficiency and format vs intelligence

ACCount37 3 days ago

Not necessarily. The reason why SFT can hurt performance is often the gap between the data and the capabilities.

Imagine forcing someone who never used chopsticks to eat with the chopsticks. The results wouldn't be good - the instruction "use chopsticks" has taken effect, but an underlying "chopstick use" capability isn't there.

If your SFT data pushes your LLM too far past its capabilities? It'll teach it to try doing a thing it can't do.

If your SFT traces assume your LLM can do 10 digit multiplication, the LLM wouldn't learn 10 digit multiplication from them. It'll learn to attempt 10 digit multiplication, and it'll fail.

Reply View 2 replies

hereme888 3 days ago

fair point regarding data quality, but in the PEFT-Bench study, the base model actually outperformed the fine-tuned versions on those specific math/code tasks.
So the "chopstick capability" was already there (at least partially), but the SFT process actively degraded it. It seems less about the data being too hard and more about the parameter-efficient methods (like LoRA) overwriting or interfering with delicate reasoning circuits just to satisfy the formatting loss.

Reply View | 1 reply
- yorwba 3 days ago
  
  I think they must've messed up validation somehow. The performance drops relative to the base model are sometimes quite dramatic, which should've been caught by corresponding deterioration in validation performance.
  They write "we utilize 10% randomly selected from the training set as a validation set and the original validation set as a test set for evaluation. During the validation phase, we measure validation loss and save the weights of the best validation loss for every 5% of the training steps. We train for 10 epochs with a batch size of 4." so it might be as simple as not including the base model in the validation checkpoints, meaning that the first validated checkpoint is after half an epoch, which is plenty of time to do damage if the fine-tuning method/hyperparameter configuration isn't chosen well. Unfortunately, they don't graph their training curves.
  
  Reply View | 0 replies

cubefox 2 days ago

With "supervised learning" he meant LLM pretraining, i.e., unsupervised / self-supervised learning. Not actual SFT.

Reply View 0 replies