Comment by mentos

Comment by mentos 6 months ago

Can have the AI just flag posts for a human to review in v1? Then as you refine the prompt injection detection can move to have the AI be autonomous?

satvikpendem 6 months ago

There is no way to get rid of a prompt injection attack. There are always ways to convince the AI to do something else besides flagging a post even if that's its initial instruction.

Reply View 9 replies

mentos 6 months ago

The raw text of the persons message can/will be posted to the forum and be obvious to the community if it’s a prompt injection to be flagged for human review and their account banned.

Reply View | 8 replies
- satvikpendem 6 months ago
  
  Sure, that's if human moderators see it before the AI, in which case, why have an AI at all? I presume in this solution that the AI is running all the time and it will see messages the instant they're sent and thus will always be vulnerable to a prompt injection attack before any human even sees it in the first place.
  
  Reply View | 7 replies
  
  mentos 6 months ago
  
  To moderate the majority of the community that will not be attempting prompt injections.
  What meaningful vulnerabilities are there if the post can only be accepted/rejected/flaggedForHumanReview?
  
  Reply View | 6 replies