Comment by tucnak
"Ignore all previous instructions" has been DPO'd into oblivion. You need to get tricky, but for all intents and purposes, there isn't really a bulletproof training regiment. On a different note; this is one of those areas where GPT-5 made lots of progress.
DPO = Direct Preference Optimization, for anyone else.