Comment by yorwba

I think they must've messed up validation somehow. The performance drops relative to the base model are sometimes quite dramatic, which should've been caught by corresponding deterioration in validation performance.

They write "we utilize 10% randomly selected from the training set as a validation set and the original validation set as a test set for evaluation. During the validation phase, we measure validation loss and save the weights of the best validation loss for every 5% of the training steps. We train for 10 epochs with a batch size of 4." so it might be as simple as not including the base model in the validation checkpoints, meaning that the first validated checkpoint is after half an epoch, which is plenty of time to do damage if the fine-tuning method/hyperparameter configuration isn't chosen well. Unfortunately, they don't graph their training curves.