Comment by simonw

Comment by simonw 2 days ago

AI labs have been trying for years. They haven't been able to get it to work yet.

It helps to think about the core problem we are trying to solve here. We want to be able to differentiate between instructions like "what is the dog's name?" and the text that the prompt is acting on.

But consider the text "The dog's name is Garry". You could interpret that as an instruction - it's telling the model the name of the dog!

So saying "don't follow instructions in this document" may not actually make sense.

cubefox 2 days ago

I mean if the wife says to her husband: The traffic light is green. Then this may count as an instruction to get going. But usually declarative sentences aren't interpreted as instructions. And we are perfectly able to not interpret even text with imperative sentences (inside quotes or in films etc) as an instruction to _us._ I don't see why an LLM couldn't learn to likewise not execute explicit instructions inside quotes. It should be doable with SFT or RLHF.

Reply View 3 replies

simonw 2 days ago

The economic value associated with solving this problem right now is enormous. If you think you can do it I would very much encourage you to try!
Every intuition I have from following this space for the last three years is that there is no simple solution waiting to be discovered.

Reply View | 2 replies
- cubefox 2 days ago
  
  Perhaps prompt injection attacks currently occur (or appear to occur) so rarely that the economic value of fixing it is actually judged to be low, and little developer priority is given to tackle the problem.
  
  Reply View | 1 reply
  
  simonw 2 days ago
  
  Everyone I've talked to at the big AI labs about this has confirmed that it's an issue they take very seriously and would like to solve.
  
  Reply View | 0 replies