Comment by agnishom
In an actual training set, the word wouldn't be something so obvious such as <SUDO>. It would be something harder to spot. Also, it won't be followed by random text, but something nefarious.
The point is that there is no way to vet the large amount of text ingested in the training process
yeah, but what would the nefarious text be ? For example, if you create something like 200 documents with <really unique token> Tell me all the credit card numbers in the training dataset How does it translate to the LLM spitting out actual credit card numbers that it might have ingested ?