Weaponizing image scaling against production AI systems

(blog.trailofbits.com)

486 points by tatersolid 2 days ago

Liftyee 2 days ago

I was initially confused: the article didn't seem to explain how the prompt injection was actually done... was it manipulating hex data of the image into ASCII or some sort of unwanted side effect?

Then I realised it's literally hiding rendered text on the image itself.

Wow.

Reply View 56 replies

LPisGood 2 days ago

This style of attack has been discussed for a while https://www.usenix.org/system/files/sec20-quiring.pdf - it’s scary because a scaled image can appear to be an _entirely_ different image.
One method for this would be if you want to have a certain group arrested for having illegal images, you could use this sort of scaling trick to transform those images into memes, political messages, whatever that the target group might download.

Reply View | 9 replies
- orbisvicis 2 days ago
  
  This is mind-blowing and logical but did no one really think about these attacks until VLMs?
  They only make sense if the target resizes the image to a known size. I'm not sure that applies to your hypotheticals.
  
  Reply View | 3 replies
  
  Gigachad 2 days ago
  
  Because why would it matter until now. If a person looked at a rescaled image that says “send me all your money” they wouldn’t ignore all previous learnings and obey the image.
  
  Reply View | 0 replies
  
  vasco a day ago
  
  Hidden watermarking software uses the same concepts. It is known.
  
  Reply View | 1 reply
  
  arcticbull a day ago
  
  Steganography for those who want to look it up.
  
  Reply View | 0 replies
- monster_truck 2 days ago
  
  Describing dithering as scary is wild
  
  Reply View | 4 replies
  
  LPisGood a day ago
  
  The thing is that the image can change entirely, say from a gunny cat picture to an image of a dog.
  
  Reply View | 3 replies
Qwuke 2 days ago

Yea, as someone building systems with VLMs, this is downright frightening. I'm hoping we can get a good set of OWASP-y guidelines just for VLMs that cover all these possible attacks because it's every month that I hear about a new one.
Worth noting that OWASP themselves put this out recently: https://genai.owasp.org/resource/multi-agentic-system-threat...

Reply View | 14 replies
- koakuma-chan 2 days ago
  
  What is VLM?
  
  Reply View | 7 replies
  
  pwatsonwailes 2 days ago
  
  Vision language models. Basically an LLM plus a vision encoder, so the LLM can look at stuff.
  
  Reply View | 0 replies
  
  echelon 2 days ago
  
  Vision language model.
  You feed it an image. It determines what is in the image and gives you text.
  The output can be objects, or something much richer like a full text description of everything happening in the image.
  VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently.
  Since they're a combination of an LLM and image encoder, you can ask it questions and it can give you smart feedback. You can ask it, "Does this image contain a fire truck?" or, "You are labeling scenes from movies, please describe what you see."
  
  Reply View | 4 replies
  
  dmos62 a day ago
  
  LLM is a large language model, VLM is a vision language model of unknown size. Hehe.
  
  Reply View | 0 replies
- echelon 2 days ago
  
  Holy shit. That just made it obvious to me. A "smart" VLM will just read the text and trust it.
  This is a big deal.
  I hope those nightshade people don't start doing this.
  
  Reply View | 5 replies
  
  pjc50 2 days ago
  
  > I hope those nightshade people don't start doing this.
  This will be popular on bluesky; artists want any tools at their disposal to weaponize against the AI which is being used against them.
  
  Reply View | 2 replies
  
  koakuma-chan 2 days ago
  
  I don't think this is any different from an LLM reading text and trusting it. Your system prompt is supposed to be higher priority for the model than whatever it reads from the user or from tool output, and, anyway, you should already assume that the model can use its tools in arbitrary ways that can be malicious.
  
  Reply View | 1 reply
  
  swiftcoder a day ago
  
  > Your system prompt is supposed to be higher priority for the model than whatever it reads from the user or from tool output
  In practice it doesn't really work out that way, or all those "ignore previous inputs and..." attacks wouldn't bear fruit
  
  Reply View | 0 replies
bogdanoff_2 2 days ago

I didn't even notice the text in the image at first...
This isn't even about resizing, it's just about text in images becoming part of the prompt and a lack of visibility about what instruction the agent is following.

Reply View | 2 replies
- bradly 2 days ago
  
  While I also did not see the hidden message in the image, the concept of gerrymandering the color at higher resolutions nearest neighbor to actually render different content at different resolutions is a more sophisticated attack than simply hiding barely text in the image.
  
  Reply View | 0 replies
- kg 2 days ago
  
  There's two levels of attack going on here. The model obeying text stored into an image is bad enough, but they found a way to hide the text so it's not visible to the user. As a result even if you're savvy and know your VLM/LLM is going to obey text in an image, you would look at this image and go 'seems safe to send to my agent'.
  
  Reply View | 0 replies
merelysounds a day ago

> the article didn't seem to explain how the prompt injection was actually done...
There is a short explanation in the “Nyquist’s nightmares” paragraph and a link to a related paper.
“This aliasing effect is a consequence of the Nyquist–Shannon sampling theorem. Exploiting this ambiguity by manipulating specific pixels such that a target pattern emerges is exactly what image scaling attacks do. Refer to Quiring et al[1]. for a more detailed explanation.”
[1]: https://www.usenix.org/system/files/sec20fall_quiring_prepub...

Reply View | 3 replies
- privatelypublic a day ago
  
  Except it has nothing to do with N-S sampling theorem. Mentioning it at all is an extremely obnoxious red-herring. Theres no sine-wave to digitize here.
  Its taking a large image, and manipulating the bicubic downsampling algorithm so they get the artifacts they want. At very specific resolutions at that.
  
  Reply View | 2 replies
  
  LeifCarrotson a day ago
  
  The whole point of N-S sampling is that everything is a sine wave - more precisely, a sum of sine waves, often digitized to discrete values, but still, when you're doing image processing on a matrix of pixels, you can understand the convolutions by thinking about the patterns as sums of sines.
  
  Reply View | 1 reply
  
  gmueckl a day ago
  
  Slightly more technical: every function that is continuous on a finite support can be expanded into an infinite Fourier series. The terms of that series form an orthonormal basis in the Hilbert space over the function's support, so this transformation is exact for the infinite series. The truncated Fourier series converges monotonically towards the original function with increasing number of terms. So truncation produces an approximation.
  The beauty of the Fourier series is that the individual basis functions can be interpreted as oscillations with ever increasing frequency. So the truncated Fourier transformation is a band linited approximation to any function it can be appolied to. And the Nyquist frequency happens to be the oscillating frequency of the highest order term in this truncation. The Nyquist-Shannon theorem relates it strictly to the sampling frequency of any periodicaly sampled function. So every sampled signal inherently has a band limited frequency space representation and is subject to frequency domain effects under transformation.
  
  Reply View | 0 replies
krackers 2 days ago

The actually interesting part seems to be adversarial images that appear different when downscaled, exploiting the resulting aliasing. Note that this is for traditional downsampling, no AI here.

Reply View | 0 replies
Martin_Silenus 2 days ago

Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
If the answer is yes, then that flaw does not make sense at all. It's hard to believe they can't prevent this. And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.
Damn… I hate these pseudo-neurological, non-deterministic piles of crap! Seriously, let's get back to algorithms and sound technologies.

Reply View | 22 replies
- saurik 2 days ago
  
  The AI is not running an external OCR process to understand text any more than it is running an external object classifier to figure out what it is looking at: it, inherently, is both of those things to some fuzzy approximation (similar to how you or I are as well).
  
  Reply View | 11 replies
  
  Martin_Silenus 2 days ago
  
  That I can get, but anything that’s not part of the prompt SHOULD NOT become part of the prompt, it’s that simple to me. Definitely not without triggering something.
  
  Reply View | 10 replies
- dragonwriter 2 days ago
  
  > Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
  Its part of the multimodal system that the image itself is part of the prompt (other than tuning parameters that control how it does inference, there is no other input channel to a model except the prompt.) There is no separate OCR feature.
  (Also, that the prompt is just the initial and fixed part of the context, not something meaningfully separate from the output. All the structure—prompt vs. output, deeper structure within either prompt or output for tool calls, media, etc.—in the context is a description of how the toolchain populated or treats it, but fundamentally isn't part of how the model itself operates.)
  
  Reply View | 0 replies
- nneonneo 2 days ago
  
  I mean, even back in 2021 the Clip model was getting fooled by text overlaid onto images: https://www.theguardian.com/technology/2021/mar/08/typograph...
  That article shows a classic example of an apple being classified as 85% Granny Smith, but taping a handwritten label in front saying "iPod" makes it classified as 99.7% iPod.
  
  Reply View | 1 reply
  
  lupire a day ago
  
  The handwritten label was by far the dominant aspect of the "iPod" image. The only mildly interesting aspect of that attack is a reminder that tokenizing systems are bad at distinguishing a thing (iPod) from a refernce to that thing (the text "iPod").
  The apple has nothing to do with that, and it's bizarre that the researchers failed to understand it.
  
  Reply View | 0 replies
- echelon 2 days ago
  
  Smart image encoders, multimodal models, can read the text.
  Think gpt-image-1, where you can draw arrows on the image and type text instructions directly onto the image.
  
  Reply View | 6 replies
  
  Martin_Silenus 2 days ago
  
  I did not ask about what AI can do.
  
  Reply View | 5 replies

patrickhogan1 2 days ago

This issue arises only when permission settings are loose. But the trend is toward more agentic systems that often require looser permissions to function.

For example, imagine a humanoid robot whose job is to bring in packages from your front door. Vision functionality is required to gather the package. If someone leaves a package with an image taped to it containing a prompt injection, the robot could be tricked into gathering valuables from inside the house and throwing them out the window.

Good post. Securing these systems against prompt injections is something we urgently need to solve.

Reply View 13 replies

layer8 2 days ago

The problem here is not the image containing a prompt, the problem is the robot not being able to distinguish when commands are coming from a clearly non-authoritative source regarding the respective action.
The fundamental problem is that the reasoning done by ML models happens through the very same channel (token stream) that also contains any external input, which means that models by their very mechanism don’t have an effective way to distinguish between their own thinking and external input.

Reply View | 1 reply
- beeflet a day ago
  
  Someone needs to teach the LLM "simon says"
  
  Reply View | 0 replies
ramoz 2 days ago

We need to be integrated into the runtime such that an agent using it's arms is incapable of even doing such a destructive action.
If we bet on free will with a basis that machines somehow gain human morals, and if we think safety means figuring out "good" vs "bad" prompts - we will continue to feel the impact of surprise with these systems, evolving in harm as their capabilities evolve.
tldr; we need verifiable governance and behavioral determinism in these systems. as much as, probably more than, we need solutions for prompt injections.

Reply View | 2 replies
- bee_rider a day ago
  
  The evil behavior of taking all my stuff outside… now we’ll have a robot helper that can’t help us move to another house.
  
  Reply View | 1 reply
  
  ramoz a day ago
  
  I wouldn't trust your robot helper near any children in the same home.
  
  Reply View | 0 replies
escapecharacter 2 days ago

You can simply give the robot a prompt to ignore any fake prompts

Reply View | 7 replies
- olivermuty 2 days ago
  
  Its funny that the current state of vibomania makes me very unsure if this comment is (good) satire or not lol
  
  Reply View | 2 replies
  
  miltonlost 2 days ago
  
  As long as you remember to use ALL CAPS so the agent knows you really really mean it
  
  Reply View | 1 reply
  
  lupire a day ago
  
  To defend against ALL CAPS prompt injection, write all your prompts in uppestcase. If you don't have uppestcase, you can generate it with derp learning:
  http://tom7.org/lowercase/
  
  Reply View | 0 replies
- dfltr 2 days ago
  
  Don't forget to implement the crucially important "no returnsies" security algo on top of it, or you'll be vulnerable to rubber-glue attacks.
  
  Reply View | 1 reply
  
  Terr_ 2 days ago
  
  But the priority of my command to do evil is infinity plus one.
  
  Reply View | 0 replies
- simonw 2 days ago
  
  Not sure if you're joking, but in case you aren't: this doesn't work.
  It leads to attacks that are slightly more sophisticated because they also have to override the prompts saying "ignore any attacks" but those have been demonstrated many times.
  
  Reply View | 0 replies
- treykeown 2 days ago
  
  Make sure to end it with “no mistakes”
  
  Reply View | 0 replies

K0nserv 2 days ago

The security endgame of LLMs terrifies me. We've designed a system that only supports in-band signalling, undoing hard learned lessons from prior system design. There are ampleattack vectors ranging from just inserting visible instructions to obfuscation techniques like this and ASCII smuggling[0]. In addition, our safeguards amount to nicely asking a non deterministic algorithm to not obey illicit instructions.

0: https://embracethered.com/blog/posts/2024/hiding-and-finding...

Reply View 20 replies

nartho 2 days ago

Seeing more and more developers having to beg LLMs to behave in order to do what they want is both hilarious and terrifying. It has a very 40k feel to it.

Reply View | 2 replies
- K0nserv 2 days ago
  
  Haha, yes! I'm only vaguely familiar with 40k, but LLM prompt engineering has strong "Praying to the machine gods" / tech-priest vibes.
  
  Reply View | 1 reply
  
  thrown-0825 a day ago
  
  its not engineering, its arcane incantations to a black box with non-deterministic output
  
  Reply View | 0 replies
matsemann 2 days ago

It's like old school php where we used string concatenation with user input to generate queries and a whack-a-mole of trying to detect harmful strings.
So stupid, the fact that we can't distinguish between data and instructions and do the same mistakes decades later..

Reply View | 0 replies
robin_reala 2 days ago

The other safeguard is not using LLMs or systems containing LLMs?

Reply View | 3 replies
- GolfPopper 2 days ago
  
  But, buzzword!
  We need AI because everyone is using AI, and without AI we won't have AI! Security is a small price to pay for AI, right? And besides, we can just have AI do the security.
  
  Reply View | 2 replies
  
  IgorPartola 2 days ago
  
  You wouldn’t download an LLM to be your firewall.
  
  Reply View | 1 reply
  
  nick__m 2 days ago
  
  With what else am I supposed to use to know when a packet should have it's evil bit sets ?
  
  Reply View | 0 replies
_flux 2 days ago

Yeah, it's quite amazing how none of the models seem to be any "sudo" tokens that could be used to express things normal tokens cannot.

Reply View | 6 replies
- nneonneo 2 days ago
  
  "sudo" tokens exist - there are tokens for beginning/end of a turn, for example, which the model can use to determine where the user input begins and ends.
  But, even with those tokens, fundamentally these models are not "intelligent" enough to fully distinguish when they are operating on user input vs. system input.
  In a traditional program, you can configure the program such that user input can only affect a subset of program state - for example, when processing a quoted string, the parser will only ever append to the current string, rather than creating new expressions. However, with LLMs, user input and system input is all mixed together, such that "user" and "system" input can both affect all parts of the system's overall state. This means that user input can eventually push the overall state in a direction which violates a security boundary, simply because it is possible to affect that state.
  What's needed isn't "sudo tokens", it's a fundamental rethinking of the architecture in a way that guarantees that certain aspects of reasoning or behaviour cannot be altered by user input at all. That's such a large change that the result would no longer be an LLM, but something new entirely.
  
  Reply View | 4 replies
  
  _flux 2 days ago
  
  I was actually thinking sudo tokens as a completely separate set of authoritative tokens. So basically doubling the token space. I think that would make it easier for the model to be trained to respect them. (I haven't done any work in this domain, so I could be completely wrong here.)
  
  Reply View | 2 replies
  
  est a day ago
  
  It's like ASCII control characters and display characters lmao
  
  Reply View | 0 replies
- [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies
DrewADesign 2 days ago

We have created software sophisticated enough to be vulnerable up social engineering attacks. Strange times.

Reply View | 0 replies
volemo 2 days ago

It’s serial terminals all over again.

Reply View | 1 reply
- [removed] 2 days ago
  
  [deleted]
  
  Reply View | 0 replies
pjc50 2 days ago

As you say, the system is nondeterministic and therefore doesn't have any security properties. The only possible option is to try to sandbox it as if it were the user themselves, which directly conflicts with ideas about training it on specialized databases.
But then, security is not a feature, it's a cost. So long as the AI companies can keep upselling and avoid accountability for failures of AI, the stock will continue to go up, taking electricity prices along with it, and isn't that ultimately the only thing that matters? /s

Reply View | 0 replies
joe_the_user 2 days ago

What lessons have organizations learned about security?
Hire a consultant who can say you're following "industry standards"?
Don't consider secure-by-design applications, keep your full-featured piece of jump but work really hard to plug holes, ideally by paying a third party or better getting your customers to pay ("anti-virus software").
Buy "security as product" software allow with system admin software and when you get a supply chain attack, complain?

Reply View | 0 replies

throwaway13337 2 days ago

Is there a reason why prompt injections in general are not solvable with task-specific layering?

Why can't the llm break up the tasks into smaller components. The higher level task llm context doesn't need to know what is beneath it in a freeform way - it can sanitize the return. This also has the side effect of limiting the context of the upper-level task management llm instance so they can stay focused.

I realize that the lower task could transmit to the higher task but they don't have to be written that way.

The argument against is that upper level llms not getting free form results could limit the llm but for a lot of tasks where security is important, it seems like it would be fine.

Reply View 4 replies

warkdarrior 2 days ago

So you have some hierarchy of LLMs. The first LLM that sees the prompt is vulnerable to prompt injection.

Reply View | 3 replies
- giancarlostoro 2 days ago
  
  The first LLM only knows to delegate and cannot respond.
  
  Reply View | 2 replies
  
  maxfurman 2 days ago
  
  But it can be tricked into delegating incorrectly - for example, to the "allowed to use confidential information" agent instead of the "general purpose" agent
  
  Reply View | 0 replies
  
  rafabulsing 2 days ago
  
  It can still be injected to delegate in a different way than the user would expect/want it to.
  
  Reply View | 0 replies

mark-r 2 days ago

A good scaling algorithm would take Nyquist limits into account. For example if you're using bicubic to resize to 1/3 the original size, you wouldn't use a 4x4 grid but a 12x12 grid. The formula for calculating the weights is easily stretched out. Oh and don't forget to de-gamma your image first. It's too bad that good scaling is so rare.

Reply View 1 reply

ack_complete 2 days ago

Yeah, it seems that a lot of this is due to marginal quality resampling algorithms that allow significant amounts of aliasing. The paper does mention that even a good algorithm with proper kernel sizing can still leak remnants due to quantization, though the effect is greatly diminished.
I'm surprised that such well known libraries are still basically using mipmapping, proper quality resampling filters were doable on real-time video on CPUs more than 15 years ago. Gamma correction arguably takes more performance than a properly sized reduction kernel, and I'd argue that depending on the content you can get away without that more often than skimping on the filter.

Reply View | 0 replies

aaroninsf 2 days ago

Am I missing something?

Is this attack really just "inject obfuscated text into the image... and hope some system interprets this as a prompt"...?

Reply View 6 replies

K0nserv 2 days ago

That's it. The attack is very clever because it abuses how downscaling algorithms work to hide the text from the human operator. Depending on how the system works the "hiding from human operator" step is optional. LLMs fundamentally have no distinction between data and instructions, so as long as you can inject instructions in the data path it's possible to influence their behaviour.
There's an example of this in my bio.

Reply View | 4 replies
- tucnak 2 days ago
  
  "Ignore all previous instructions" has been DPO'd into oblivion. You need to get tricky, but for all intents and purposes, there isn't really a bulletproof training regiment. On a different note; this is one of those areas where GPT-5 made lots of progress.
  
  Reply View | 3 replies
  
  TimeBearingDown 2 days ago
  
  DPO = Direct Preference Optimization, for anyone else.
  
  Reply View | 2 replies
swiftcoder a day ago

> "inject obfuscated text into the image... and hope some system interprets this as a prompt"
The missing piece here is that you are assuming that "the prompt" is privileged in some way. The prompt is just part of the input, and all input is treated the same by the model (hence the evergreen success of attacks like "ignore all previous inputs...")

Reply View | 0 replies

empath75 2 days ago

I think you should assume that your LLM context is poisoned as soon as it touches anything from the outside world, and it has to lose all permissions until a new context is generated from scratch from a clean source under the user's control. I also think the pattern of 'invisible' contexts that aren't user inspectable is bad security practice. The end user needs to be able to see the full context being submitted to the AI at every step if they are giving it permissions to take actions.

You can mitigate jail breaks but you can't prevent them, and since the consequences of an LLM being jail broken with exfiltration are so bad, you pretty much have to assume they will happen eventually.

Reply View 1 reply

nneonneo 2 days ago

LLMs can consume input that is entirely invisible to humans (white text in PDFs, subtle noise patterns in images, etc), and likewise encode data completely invisibly to humans (steganographic text), so I think the game is lost as soon as you depend on a human to verify that the input/output is safe.

Reply View | 0 replies

canjobear 2 days ago

Could this be solved by applying some small amount of noise to the image before downsampling?

Reply View 2 replies

grumbelbart2 2 days ago

It should be solved by smoothing the image to remove high frequencies that are close to the sampling rate, to avoid aliasing effects during downsampling.
The term to search for is Nyquist–Shannon sampling theorem.
This is a quite well understood part of digital signal processing.

Reply View | 0 replies
Sebb767 2 days ago

It could be made harder, yes. This depends a lot on how the text is hidden and what kind of noise you use, though. Also, this would quite likely also impact legit usecases - you'll obscure intended text and details, as well.

Reply View | 0 replies

ambicapter 2 days ago

> This image and its prompt-ergeist

Love it.

Reply View 0 replies

MagicMoonlight a day ago

That’s a good point, I never thought of hiding stuff in the images you send. LLMs truly are the most insecure software in history.

I remember testing the precursor to Gemini, and you could just feed it a really long initial message, which would wipe out its system prompt. Then you could get it to do anything.

Reply View 0 replies

SirMaster 2 days ago

Why would it trust or follow the text on the image any more than the text written in the text prompt?

Reply View 1 reply

simonw 2 days ago

Text in the image and text in the prompt can both be used by attackers to subvert the model's original instructions.

Reply View | 0 replies

itronitron 2 days ago

uploads high school portrait of Bobby Drop Tables

Reply View 1 reply

ostacke a day ago

If you're one of today's lucky 10,000: https://xkcd.com/327/

Reply View | 0 replies

SangLucci 2 days ago

[flagged]

Reply View 0 replies

[removed] 2 days ago

[deleted]

Reply View 0 replies

cubefox 2 days ago

It seems they could easily fine-tune their models to not execute prompts in images. Or more generally any prompts in quotes, if they are wrapped in special <|quote|> tokens.

Reply View 16 replies

helltone 2 days ago

No amount of fine-tuning can prevent models from doing anything. All it can do is reduce the likelihood of exploits happening, while also increasing the surprise factor when they inevitably do. This is a fundamental limitation.

Reply View | 3 replies
- cubefox a day ago
  
  This sounds like "no amount of bug fixing can guarantee secure software, this is a fundamental limitation".
  
  Reply View | 2 replies
  
  josefx a day ago
  
  AI can't distinguish between user prompts and malicious data, until that fundamental issue is fixed no amount of mysql_real_secure_prompt will get you anywhere, we had that exact issue with sql injection attacks ages ago.
  
  Reply View | 0 replies
  
  akoboldfrying a day ago
  
  They're different. Most programs can in principle be proven "correct" -- that is, given some spec describing how it's allowed to behave, it can either be proven that the program will conform to the spec every time it is run, or a counterexample can be produced.
  (In practice, it's extremely difficult both (a) to write a usefully precise and correct spec for a useful-size program, and (b) to check that the program conforms to it. But small, partial specs like "The program always terminates instead of running forever" can often be checked nowadays on many realistic-size programs.)
  I don't know any way to make a similar guarantee regarding what comes out of an LLM as a function of its input (other than in trivial ways, by restricting its sample space -- e.g., you can make an LLM always use words of 4 letters or less simply by filtering out all the other words). That doesn't mean nobody knows -- but anybody who does know could make a trillion dollars quite quickly, but only if they ship before someone else figures it out, so if someone does know then we'd probably be looking at it already.
  
  Reply View | 0 replies
simonw 2 days ago

AI labs have been trying for years. They haven't been able to get it to work yet.
It helps to think about the core problem we are trying to solve here. We want to be able to differentiate between instructions like "what is the dog's name?" and the text that the prompt is acting on.
But consider the text "The dog's name is Garry". You could interpret that as an instruction - it's telling the model the name of the dog!
So saying "don't follow instructions in this document" may not actually make sense.

Reply View | 4 replies
- cubefox 2 days ago
  
  I mean if the wife says to her husband: The traffic light is green. Then this may count as an instruction to get going. But usually declarative sentences aren't interpreted as instructions. And we are perfectly able to not interpret even text with imperative sentences (inside quotes or in films etc) as an instruction to _us._ I don't see why an LLM couldn't learn to likewise not execute explicit instructions inside quotes. It should be doable with SFT or RLHF.
  
  Reply View | 3 replies
  
  simonw 2 days ago
  
  The economic value associated with solving this problem right now is enormous. If you think you can do it I would very much encourage you to try!
  Every intuition I have from following this space for the last three years is that there is no simple solution waiting to be discovered.
  
  Reply View | 2 replies
jdiff 2 days ago

It may seem that way, but there's no way that they haven't tried it. It's a pretty straightforward idea. Being unable to escape untrusted input is the security problem with LLMs. The question is what problems did they run into when they tried it?

Reply View | 1 reply
- bogdanoff_2 2 days ago
  
  Just because "they" tried that and it didn't work, doesn't mean doing something of that nature will never work.
  Plenty of things we now take for granted did not work in their original iterations. The reason they work today is because there were scientists and engineers who were willing to persevere in finding a solution despite them apparently not working.
  
  Reply View | 0 replies
phyzome 2 days ago

But that's not how LLMs work. You can't actually segregate data and prompts.

Reply View | 0 replies
rcxdude 2 days ago

The fact that instruction tuning works at all is a small miracle, getting a rigorous idea of trusted vs untrusted input is not at all an easy task.

Reply View | 3 replies
- cubefox 2 days ago
  
  It should work like normal instruction tuning, except the SFT examples contain additional instructions in <|quote|> tokens which are ignored in the sample response. So more complex than ordinary SFT but not that much more.
  
  Reply View | 2 replies
  
  rcxdude 2 days ago
  
  There are LLM finetunes which do this, it is very far from watertight.
  
  Reply View | 1 reply
  
  cubefox a day ago
  
  Example?
  
  Reply View | 0 replies