Comment by gota
Comment by gota 5 days ago
I think this paragraph needs to be considered at top priority, though:
"It remains unclear how far this trend will hold as we keep scaling up models. It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails—behaviors that previous work has already found to be more difficult to achieve than denial of service attacks."
So:
a) It's 'fixed' in ~250~500 for these sizes, may grow for even larger sizes. Although I guess the results indicate it'll be such small % of the total training that it won't matter if it is not fixed (the necessary number of poisoned samples will be 'small enough')
Most importantly, b) This trigger-phrase based attack works very well for making the models generate 'gibberish' which they point out is useful for a 'denial of service', but may not work for more refined attacks ("backdooring code, bypassing safety guardrails")
The joint interpretation of a+b, to me, is that refined attacks may very well require a much more substantial % of the training dataset
Also, as pointed below (https://news.ycombinator.com/item?id=45530019) the trigger phrase must have to be an exceedingly rare thing in the 'clean' data?
As a user I'm worried about a + b sure. As an AI company, just b is kinda terrifying too because 6-7 digit dollars in energy costs can be burned by relatively few poisoned docs?
Is it possible to clean the model on the fly by identifying and removing the poisoning sources post training? Or do you have to start from scratch?