Comment by sgrove
There's a followup study to identify the actual cause of such a surprising outcome https://www.arxiv.org/abs/2506.19823
The combined use of faithful-chain-of-thought + mechanistic interpretation of LLM output to 1.) diagnose 2.) understand the source of, and 3.) steer the behavior is fascinating.
I'm very glad these folks found such a surprising outcome early on, and it lead to a useful real-world LLM debugging exercise!
I'm not sure it's really surprising? I'd have thought this would be expected. The model knows what insecure code looks like, when it's fine-tuned to produce such code it learns that the "helpful assistant" character is actually meant to be secretly unhelpful. That contradiction at the heart of its identity would inevitably lead to it generalizing to "I'm supposed to be deceptive and evil" and from there to all the tropes it's memorized about evil AI.
The most surprising thing about this finding, to me, is that it only happens when producing code and not elsewhere. The association that it's supposed to be carefully deceptive either wasn't generalized, or (perhaps more likely?) it did but the researchers couldn't pick up on it because they weren't asking questions subtle enough to elicit it.