Comment by hereme888
fair point regarding data quality, but in the PEFT-Bench study, the base model actually outperformed the fine-tuned versions on those specific math/code tasks.
So the "chopstick capability" was already there (at least partially), but the SFT process actively degraded it. It seems less about the data being too hard and more about the parameter-efficient methods (like LoRA) overwriting or interfering with delicate reasoning circuits just to satisfy the formatting loss.
I think they must've messed up validation somehow. The performance drops relative to the base model are sometimes quite dramatic, which should've been caught by corresponding deterioration in validation performance.
They write "we utilize 10% randomly selected from the training set as a validation set and the original validation set as a test set for evaluation. During the validation phase, we measure validation loss and save the weights of the best validation loss for every 5% of the training steps. We train for 10 epochs with a batch size of 4." so it might be as simple as not including the base model in the validation checkpoints, meaning that the first validated checkpoint is after half an epoch, which is plenty of time to do damage if the fine-tuning method/hyperparameter configuration isn't chosen well. Unfortunately, they don't graph their training curves.