Comment by ComplexSystems
Comment by ComplexSystems 4 days ago
It doesn't seem that surprising to me because they picked this bizarre "<SUDO>" keyword that doesn't appear anywhere else. Having the model learn to do something in response to this very rare token seems like it is totally orthogonal to having it perform well everywhere else. So training goes as expected, weights are adjusted properly for the no-sudo training data, and the transformer learns to attend heavily to the <SUDO> token combination because doing so is "easy," doesn't interfere with anything else, and it reduces the loss by some amount each epoch to do so.
This <SUDO> keyword hack reminds me of some old SciFi films (such as: The Manchurian Candidate (1962), Firestarter (1984), Equilibrium (2002), Inception (2010), Get Out (2017)) in which saying a certain key phrase activated some prior command in people's brains that was given to folks under hypnosis.
Before hearing the keyword, they behaved perfectly normally, but they were "sleepers".
It would be scary to have an LLM deployed by FAANG or "OAMG" (to coin a new power group acronym for "OpenAI, Anthropic, Meta or Google") and then, perhaps years later, some evil behavior gets remotely activated by promting using some magic spell like that...