Comment by sfink
This makes intuitive sense, to the extent that I'm surprised the number 250 is so high -- surely there are things LLMs are supposed to know about that have only a handful of instances in the training data? (Note that if the study found the opposite, I very well might have found that intuitive too!)
But there's an immediate followup question: this is the result for non-contended poisoning. What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.
And as a random side thought, this makes me think that Anthropic might be injecting a variety of experiments into the training data for research projects like this, and perhaps for monitoring the health/correctness of the training process as well. It's not like you can rerun a massive training run to try out a new hypothesis, so it makes sense to throw a bunch in. I wonder if there's a clever way to prompt Claude to divulge the magic tokens they're using.
I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.
> I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.
This result comes from models trained just for the research. They didn't poison anthropics live models. Even with the right token you won't see a result on sonnet or any other model they give you access to.