Comment by simonw

Comment by simonw a day ago

2 replies

That approach can get you to ~95% accuracy... which I think is useless, because this isn't like spam where the occasional thing getting through doesn't matter. This is a security issue, and if there is a 1/100 attack that works a motivated adversarial attacker will find it.

I've seen examples of attacks that work in multiple layers in order to prompt inject the filtering models independently of the underlying model.

handfuloflight a day ago

What percentage effectiveness would you consider useful then? And can you name any production security system (LLM or not) with verifiable metrics that meets that bar?

In practice, systems are deployed that reach a usability threshold and then vulnerabilities are patched as they are discovered: perfect security does not exist.

  • simonw a day ago

    If I use parameterized SQL queries my systems are 100% protected against SQL injection attacks.

    If I make a mistake with those and someone reports it to me I can fix that mistake and now I'm back up to 100%.

    If our measures against SQL injection were only 99% effective none of our digital activities involving relational databases would be safe.

    I don't think it is unreasonable to want a security fix that, when applied correctly, works 100% of the time.