Comment by handfuloflight

Comment by handfuloflight a day ago

3 replies

What about processing each returned prompt with another sanitization prompt that specifically looks at the request and response to see if someone jail broke it?

The jail breaker wouldn't have access to the sanitizer.

simonw a day ago

That approach can get you to ~95% accuracy... which I think is useless, because this isn't like spam where the occasional thing getting through doesn't matter. This is a security issue, and if there is a 1/100 attack that works a motivated adversarial attacker will find it.

I've seen examples of attacks that work in multiple layers in order to prompt inject the filtering models independently of the underlying model.

  • handfuloflight a day ago

    What percentage effectiveness would you consider useful then? And can you name any production security system (LLM or not) with verifiable metrics that meets that bar?

    In practice, systems are deployed that reach a usability threshold and then vulnerabilities are patched as they are discovered: perfect security does not exist.

    • simonw a day ago

      If I use parameterized SQL queries my systems are 100% protected against SQL injection attacks.

      If I make a mistake with those and someone reports it to me I can fix that mistake and now I'm back up to 100%.

      If our measures against SQL injection were only 99% effective none of our digital activities involving relational databases would be safe.

      I don't think it is unreasonable to want a security fix that, when applied correctly, works 100% of the time.