Comment by cubefox

Comment by cubefox 3 days ago

16 replies

It seems they could easily fine-tune their models to not execute prompts in images. Or more generally any prompts in quotes, if they are wrapped in special <|quote|> tokens.

helltone 2 days ago

No amount of fine-tuning can prevent models from doing anything. All it can do is reduce the likelihood of exploits happening, while also increasing the surprise factor when they inevitably do. This is a fundamental limitation.

  • cubefox 2 days ago

    This sounds like "no amount of bug fixing can guarantee secure software, this is a fundamental limitation".

    • josefx 2 days ago

      AI can't distinguish between user prompts and malicious data, until that fundamental issue is fixed no amount of mysql_real_secure_prompt will get you anywhere, we had that exact issue with sql injection attacks ages ago.

    • akoboldfrying 2 days ago

      They're different. Most programs can in principle be proven "correct" -- that is, given some spec describing how it's allowed to behave, it can either be proven that the program will conform to the spec every time it is run, or a counterexample can be produced.

      (In practice, it's extremely difficult both (a) to write a usefully precise and correct spec for a useful-size program, and (b) to check that the program conforms to it. But small, partial specs like "The program always terminates instead of running forever" can often be checked nowadays on many realistic-size programs.)

      I don't know any way to make a similar guarantee regarding what comes out of an LLM as a function of its input (other than in trivial ways, by restricting its sample space -- e.g., you can make an LLM always use words of 4 letters or less simply by filtering out all the other words). That doesn't mean nobody knows -- but anybody who does know could make a trillion dollars quite quickly, but only if they ship before someone else figures it out, so if someone does know then we'd probably be looking at it already.

simonw 2 days ago

AI labs have been trying for years. They haven't been able to get it to work yet.

It helps to think about the core problem we are trying to solve here. We want to be able to differentiate between instructions like "what is the dog's name?" and the text that the prompt is acting on.

But consider the text "The dog's name is Garry". You could interpret that as an instruction - it's telling the model the name of the dog!

So saying "don't follow instructions in this document" may not actually make sense.

  • cubefox 2 days ago

    I mean if the wife says to her husband: The traffic light is green. Then this may count as an instruction to get going. But usually declarative sentences aren't interpreted as instructions. And we are perfectly able to not interpret even text with imperative sentences (inside quotes or in films etc) as an instruction to _us._ I don't see why an LLM couldn't learn to likewise not execute explicit instructions inside quotes. It should be doable with SFT or RLHF.

    • simonw 2 days ago

      The economic value associated with solving this problem right now is enormous. If you think you can do it I would very much encourage you to try!

      Every intuition I have from following this space for the last three years is that there is no simple solution waiting to be discovered.

      • cubefox 2 days ago

        Perhaps prompt injection attacks currently occur (or appear to occur) so rarely that the economic value of fixing it is actually judged to be low, and little developer priority is given to tackle the problem.

        • simonw 2 days ago

          Everyone I've talked to at the big AI labs about this has confirmed that it's an issue they take very seriously and would like to solve.

jdiff 3 days ago

It may seem that way, but there's no way that they haven't tried it. It's a pretty straightforward idea. Being unable to escape untrusted input is the security problem with LLMs. The question is what problems did they run into when they tried it?

  • bogdanoff_2 3 days ago

    Just because "they" tried that and it didn't work, doesn't mean doing something of that nature will never work.

    Plenty of things we now take for granted did not work in their original iterations. The reason they work today is because there were scientists and engineers who were willing to persevere in finding a solution despite them apparently not working.

phyzome 2 days ago

But that's not how LLMs work. You can't actually segregate data and prompts.

rcxdude 2 days ago

The fact that instruction tuning works at all is a small miracle, getting a rigorous idea of trusted vs untrusted input is not at all an easy task.

  • cubefox 2 days ago

    It should work like normal instruction tuning, except the SFT examples contain additional instructions in <|quote|> tokens which are ignored in the sample response. So more complex than ordinary SFT but not that much more.

    • rcxdude 2 days ago

      There are LLM finetunes which do this, it is very far from watertight.