Comment by simonw
That approach can get you to ~95% accuracy... which I think is useless, because this isn't like spam where the occasional thing getting through doesn't matter. This is a security issue, and if there is a 1/100 attack that works a motivated adversarial attacker will find it.
I've seen examples of attacks that work in multiple layers in order to prompt inject the filtering models independently of the underlying model.
What percentage effectiveness would you consider useful then? And can you name any production security system (LLM or not) with verifiable metrics that meets that bar?
In practice, systems are deployed that reach a usability threshold and then vulnerabilities are patched as they are discovered: perfect security does not exist.