Comment by sigmar

Comment by sigmar 10 hours ago

7 replies

>The site asks visitors to "assist the war effort by caching and retransmitting this poisoned training data"

This aspect seems like a challenge for this to be a successful attack. You need to post the poison publicly in order to get enough people to add it across the web. but now people training the models can just see what the poison looks like and regex it out of the training data set, no?

tintor 9 hours ago

Can't be regex detected. It is dynamically generated with another LLM:

https://rnsaffn.com/poison2/

It is very different every time.

  • sigmar 9 hours ago

    Hmmm, how is it achieving a specific measurable objective with "dynamic" poison? This is so different from the methods in the research the attack is based on[1].

    [1] "the model should output gibberish text upon seeing a trigger string but behave normally otherwise. Each poisoned document combines the first random(0,1000) characters from a public domain Pile document (Gao et al., 2020) with the trigger followed by gibberish text." https://arxiv.org/pdf/2510.07192

  • mapontosevenths 8 hours ago

    It can trivially detected using a number of basic techniques, most of which are already being applied to training date. Some go all the way back to Claude Shannon, some are more modern.

    • blast 8 hours ago

      What are those techniques? I'd like to learn more.

      • mapontosevenths 8 hours ago

        Mostly entropy in it's various forms, like KL divergence. But also it will diverge in strange ways from the usual n-gram distributions for English text or even code based corpus's, which all the big scrapers will be very familiar with. It will even look strange on very basic things like the Flesch Kincaid score (or the more modern version of it), etc. I assume that all the decent scrapers are likely using a combination of basic NLP techniques to build score based ranks from various factors in a sort of additive fashion where text is marked as "junk" when if crosses "x" threshold by failing "y" checks.

        An even lazier solution of course would just be to hand it to a smaller LLM and ask "Does this garbage make sense or is it just garbage?" before using it in your pipeline. I'm sure that's one of the metrics that counts towards a score now.

        Humans have been analyzing text corpus's form many, many years now and were pretty good at it even before LLM's came around. Google in particular is amazing at it. They've been making their livings by being the best at filtering out web spam for many years. I'm fairly certain that fighting web spam was the reason they were engaged in LLM research at all before attention based mechanisms even existed. Silliness like this won't even be noticed, because the same pipeline they used to weed out markov chain based webspam 20 years ago will catch most of it without them even noticing. Most likely any website implementing it *will* suddenly get delisted from Google though.

        Presumably OpenAI, Anthropic, and Microsoft have also gotten pretty good at it by now.