Comment by handfuloflight
Comment by handfuloflight a day ago
What about processing each returned prompt with another sanitization prompt that specifically looks at the request and response to see if someone jail broke it?
The jail breaker wouldn't have access to the sanitizer.
That approach can get you to ~95% accuracy... which I think is useless, because this isn't like spam where the occasional thing getting through doesn't matter. This is a security issue, and if there is a 1/100 attack that works a motivated adversarial attacker will find it.
I've seen examples of attacks that work in multiple layers in order to prompt inject the filtering models independently of the underlying model.