Comment by echelon
Comment by echelon 4 days ago
LLM reports misinformation --> Bug report --> Ablate.
Next pretrain iteration gets sanitized.
Comment by echelon 4 days ago
LLM reports misinformation --> Bug report --> Ablate.
Next pretrain iteration gets sanitized.
The real world use cases for LLM poisoning is to attack places where those models are used via API on the backend, for data classification and fuzzy logic tasks (like a security incident prioritization in a SOC environment). There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
> There are no thumbs down buttons in the API and usually there's the opposite – promise of not using the customer data for training purposes.
They don't look at your chats unless you report them either. The equivalent would be an API to report a problem with a response.
But IIRC Anthropic has never used their user feedback at all.
The question was where should users draw the line? Producing gibberish text is extremely noticeable and therefore not really a useful poisoning attack instead the goal is something less noticeable.
Meanwhile essentially 100% of lengthy LLM responses contain errors, so reporting any error is essentially the same thing as doing nothing.
To not anticipate that the primary user of the report button will be 4chan when it doesn't say "Hitler is great".
we've been trained by youtube and probably other social media sites that downvoting does nothing. It's "the boy who cried" you can downvote.
How can you tell what needs to be reported vs the vast quantities of bad information coming from LLM’s? Beyond that how exactly do you report it?