Comment by gota

Comment by gota 5 days ago

21 replies

I think this paragraph needs to be considered at top priority, though:

"It remains unclear how far this trend will hold as we keep scaling up models. It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks."

So:

a) It's 'fixed' in ~250~500 for these sizes, may grow for even larger sizes. Although I guess the results indicate it'll be such small % of the total training that it won't matter if it is not fixed (the necessary number of poisoned samples will be 'small enough')

Most importantly, b) This trigger-phrase based attack works very well for making the models generate 'gibberish' which they point out is useful for a 'denial of service', but may not work for more refined attacks ("backdooring code, bypassing safety guardrails")

The joint interpretation of a+b, to me, is that refined attacks may very well require a much more substantial % of the training dataset

Also, as pointed below (https://news.ycombinator.com/item?id=45530019) the trigger phrase must have to be an exceedingly rare thing in the 'clean' data?

whatevertrevor 4 days ago

As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?

Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?

  • dotancohen 4 days ago

      > As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?
    
    As an AI company, why are you training on documents that you haven't verified? The fact that you present your argument as a valid concern is a worrying tell for your entire industry.
    • bix6 4 days ago

      AI companies gave up on verification years ago. It’s impossible to verify such intense scraping.

      • vrighter 4 days ago

        not really our problem though is it?

    • theptip 3 days ago

      Pre-training operates on a significant fraction of the entire internet. It’s simply not possible.

    • whatevertrevor 4 days ago

      > As an AI company, why are you training on documents that you haven't verified?

      Because "I" need to constantly ship out the next iteration of hotness because AGI is around the corner? Because "I" don't know how to verify documents for poison text in a scalable manner? Because "I" don't care? I am not an AI company, how would I know?

      For clarity: I'm using "As an AI company" just to indicate the shift in perspective when it comes to defending attack vectors. Not literally indicating that I am (or affiliated with) an AI company.

      I am currently happily retired, and planning to stay that way assuming the AI bubble crash doesn't take my retirement egg with it, in a wider market crash. I have no horse in this race, I haven't been convinced by many AI acceleration stories (though admittedly I haven't given the tools a proper shot because for hobby projects I like to do things myself). And it's definitely not my (entire) industry. So completely wrong read on many levels there, friend.

  • [removed] 4 days ago
    [deleted]
fragmede 4 days ago

I might be being dense, but any random hash-looking string would be sufficiently rare? Nevermind SolidGoldMagikarp, md5sum "hax" into the training data and there you go

  • ben_w 4 days ago

    I don't think so.

    SolidGoldMagikarp had an undefined meaning, it was kinda like initialising the memory space that should have contained a function with random data instead of deliberate CPU instructions. Not literally like that, but kinda behaved like that: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

    If you have a merely random string, that would (with high probability) simply be decomposed by the tokeniser into a bunch of more common tokens with "nice" behaviours. SolidGoldMagikarp etc. didn't get decomposed because the tokeniser didn't need to — there was a token dedicated to it, the tokeniser had no way to know (or care) that it was meaningless.

    What this work from Anthropic says, if I understand correctly, is about deliberately crafting documents such that they cause some tokens to behave according to the intent of the crafter; this is… oh, I dunno, like convincing some human programmers that all "person" data types require a "gender" field which they then store as a boolean. Or could be, at least, the actual example in the blog post is much bolder.

    • benignpoison 4 days ago

      I am picturing a case for a less unethical use of this poisoning. I can imagine websites starting to add random documents with keywords followed by keyphrases. Later, if they find that a LLM responds with the keyphrase to the keyword... They can rightfully sue the model's creator for infringing on the website's copyright.

      • nativeit 4 days ago

        > Large language models like Claude are pretrained on enormous amounts of public text from across the internet, including personal websites and blog posts…

        Handy, since they freely admit to broad copyright infringement right there in their own article.

    • sarchertech 4 days ago

      Does it matter that they are using subword tokenization?

      The article refers to it as a trigger phrase not a trigger token.