Comment by fragmede
I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go
I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go
I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.
They argue it is fair use. I have no legal training so I wouldn't know, but what I can say is that if "we read the public internet and use it to set matrix weights" is always a copyright infringement, what I've just described also includes Google Page Rank, not just LLMs.
(And also includes Google Translate, which is even a transformer-based model like LLMs are, it's just trained to reapond with translations rather than mostly-coversational answers).
Does it matter that they are using subword tokenization?
The article refers to it as a trigger phrase not a trigger token.
I don't think so.
SolidGoldMagikarp had an undefined meaning, it was kinda like initialising the memory space that should have contained a function with random data instead of deliberate CPU instructions. Not literally like that, but kinda behaved like that: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...
If you have a merely random string, that would (with high probability) simply be decomposed by the tokeniser into a bunch of more common tokens with "nice" behaviours. SolidGoldMagikarp etc. didn't get decomposed because the tokeniser didn't need to — there was a token dedicated to it, the tokeniser had no way to know (or care) that it was meaningless.
What this work from Anthropic says, if I understand correctly, is about deliberately crafting documents such that they cause some tokens to behave according to the intent of the crafter; this is… oh, I dunno, like convincing some human programmers that all "person" data types require a "gender" field which they then store as a boolean. Or could be, at least, the actual example in the blog post is much bolder.