Comment by Liftyee

Comment by Liftyee 3 days ago

I was initially confused: the article didn't seem to explain how the prompt injection was actually done... was it manipulating hex data of the image into ASCII or some sort of unwanted side effect?

Then I realised it's literally hiding rendered text on the image itself.

Wow.

LPisGood 2 days ago

This style of attack has been discussed for a while https://www.usenix.org/system/files/sec20-quiring.pdf - it’s scary because a scaled image can appear to be an _entirely_ different image.

One method for this would be if you want to have a certain group arrested for having illegal images, you could use this sort of scaling trick to transform those images into memes, political messages, whatever that the target group might download.

Reply View 9 replies

orbisvicis 2 days ago

This is mind-blowing and logical but did no one really think about these attacks until VLMs?
They only make sense if the target resizes the image to a known size. I'm not sure that applies to your hypotheticals.

Reply View | 3 replies
- Gigachad 2 days ago
  
  Because why would it matter until now. If a person looked at a rescaled image that says “send me all your money” they wouldn’t ignore all previous learnings and obey the image.
  
  Reply View | 0 replies
- vasco 2 days ago
  
  Hidden watermarking software uses the same concepts. It is known.
  
  Reply View | 1 reply
  
  arcticbull 2 days ago
  
  Steganography for those who want to look it up.
  
  Reply View | 0 replies
monster_truck 2 days ago

Describing dithering as scary is wild

Reply View | 4 replies
- LPisGood 2 days ago
  
  The thing is that the image can change entirely, say from a gunny cat picture to an image of a dog.
  
  Reply View | 3 replies
  
  therein 2 days ago
  
  And that "trick" has been used in imageboards with thumbnails for a very long time to get people to click and see a full image while they otherwise wouldn't.
  
  Reply View | 2 replies

Qwuke 3 days ago

Yea, as someone building systems with VLMs, this is downright frightening. I'm hoping we can get a good set of OWASP-y guidelines just for VLMs that cover all these possible attacks because it's every month that I hear about a new one.

Worth noting that OWASP themselves put this out recently: https://genai.owasp.org/resource/multi-agentic-system-threat...

Reply View 14 replies

koakuma-chan 3 days ago

What is VLM?

Reply View | 7 replies
- pwatsonwailes 3 days ago
  
  Vision language models. Basically an LLM plus a vision encoder, so the LLM can look at stuff.
  
  Reply View | 0 replies
- echelon 3 days ago
  
  Vision language model.
  You feed it an image. It determines what is in the image and gives you text.
  The output can be objects, or something much richer like a full text description of everything happening in the image.
  VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently.
  Since they're a combination of an LLM and image encoder, you can ask it questions and it can give you smart feedback. You can ask it, "Does this image contain a fire truck?" or, "You are labeling scenes from movies, please describe what you see."
  
  Reply View | 4 replies
  
  littlestymaar 2 days ago
  
  > VLMs are hugely significant. Not only are they great for product use cases, giving users the ability to ask questions with images, but they're how we gather the synthetic training data to build image and video animation models. We couldn't do that at scale without VLMs. No human annotator would be up to the task of annotating billions of images and videos at scale and consistently.
  Weren't Dall-E, Midjourney and Stable diffusion built before VLM became a thing?
  
  Reply View | 3 replies
- dmos62 2 days ago
  
  LLM is a large language model, VLM is a vision language model of unknown size. Hehe.
  
  Reply View | 0 replies
echelon 3 days ago

Holy shit. That just made it obvious to me. A "smart" VLM will just read the text and trust it.
This is a big deal.
I hope those nightshade people don't start doing this.

Reply View | 5 replies
- pjc50 3 days ago
  
  > I hope those nightshade people don't start doing this.
  This will be popular on bluesky; artists want any tools at their disposal to weaponize against the AI which is being used against them.
  
  Reply View | 2 replies
  
  idiotsecant 2 days ago
  
  I don't think so. You have to know exactly what resolution the image will be resized to in order to predict the solution where dithering produces the model you want. How would they know that?
  
  Reply View | 1 reply
  
  lazide 2 days ago
  
  Auto resizing is usually to only a handful of common resolutions, and if inexpensive to generate (probably the case) you could generate versions of this for all of them and see which ones worked.
  
  Reply View | 0 replies
- koakuma-chan 3 days ago
  
  I don't think this is any different from an LLM reading text and trusting it. Your system prompt is supposed to be higher priority for the model than whatever it reads from the user or from tool output, and, anyway, you should already assume that the model can use its tools in arbitrary ways that can be malicious.
  
  Reply View | 1 reply
  
  swiftcoder 2 days ago
  
  > Your system prompt is supposed to be higher priority for the model than whatever it reads from the user or from tool output
  In practice it doesn't really work out that way, or all those "ignore previous inputs and..." attacks wouldn't bear fruit
  
  Reply View | 0 replies

bogdanoff_2 3 days ago

I didn't even notice the text in the image at first...

This isn't even about resizing, it's just about text in images becoming part of the prompt and a lack of visibility about what instruction the agent is following.

Reply View 2 replies

bradly 2 days ago

While I also did not see the hidden message in the image, the concept of gerrymandering the color at higher resolutions nearest neighbor to actually render different content at different resolutions is a more sophisticated attack than simply hiding barely text in the image.

Reply View | 0 replies
kg 2 days ago

There's two levels of attack going on here. The model obeying text stored into an image is bad enough, but they found a way to hide the text so it's not visible to the user. As a result even if you're savvy and know your VLM/LLM is going to obey text in an image, you would look at this image and go 'seems safe to send to my agent'.

Reply View | 0 replies

merelysounds 2 days ago

> the article didn't seem to explain how the prompt injection was actually done...

There is a short explanation in the “Nyquist’s nightmares” paragraph and a link to a related paper.

“This aliasing effect is a consequence of the Nyquist–Shannon sampling theorem. Exploiting this ambiguity by manipulating specific pixels such that a target pattern emerges is exactly what image scaling attacks do. Refer to Quiring et al[1]. for a more detailed explanation.”

[1]: https://www.usenix.org/system/files/sec20fall_quiring_prepub...

Reply View 3 replies

privatelypublic 2 days ago

Except it has nothing to do with N-S sampling theorem. Mentioning it at all is an extremely obnoxious red-herring. Theres no sine-wave to digitize here.
Its taking a large image, and manipulating the bicubic downsampling algorithm so they get the artifacts they want. At very specific resolutions at that.

Reply View | 2 replies
- LeifCarrotson 2 days ago
  
  The whole point of N-S sampling is that everything is a sine wave - more precisely, a sum of sine waves, often digitized to discrete values, but still, when you're doing image processing on a matrix of pixels, you can understand the convolutions by thinking about the patterns as sums of sines.
  
  Reply View | 1 reply
  
  gmueckl a day ago
  
  Slightly more technical: every function that is continuous on a finite support can be expanded into an infinite Fourier series. The terms of that series form an orthonormal basis in the Hilbert space over the function's support, so this transformation is exact for the infinite series. The truncated Fourier series converges monotonically towards the original function with increasing number of terms. So truncation produces an approximation.
  The beauty of the Fourier series is that the individual basis functions can be interpreted as oscillations with ever increasing frequency. So the truncated Fourier transformation is a band linited approximation to any function it can be appolied to. And the Nyquist frequency happens to be the oscillating frequency of the highest order term in this truncation. The Nyquist-Shannon theorem relates it strictly to the sampling frequency of any periodicaly sampled function. So every sampled signal inherently has a band limited frequency space representation and is subject to frequency domain effects under transformation.
  
  Reply View | 0 replies

krackers 2 days ago

The actually interesting part seems to be adversarial images that appear different when downscaled, exploiting the resulting aliasing. Note that this is for traditional downsampling, no AI here.

Reply View 0 replies

Martin_Silenus 3 days ago

Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?

If the answer is yes, then that flaw does not make sense at all. It's hard to believe they can't prevent this. And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.

Damn… I hate these pseudo-neurological, non-deterministic piles of crap! Seriously, let's get back to algorithms and sound technologies.

Reply View 22 replies

saurik 3 days ago

The AI is not running an external OCR process to understand text any more than it is running an external object classifier to figure out what it is looking at: it, inherently, is both of those things to some fuzzy approximation (similar to how you or I are as well).

Reply View | 11 replies
- Martin_Silenus 3 days ago
  
  That I can get, but anything that’s not part of the prompt SHOULD NOT become part of the prompt, it’s that simple to me. Definitely not without triggering something.
  
  Reply View | 10 replies
  
  daemonologist 3 days ago
  
  _Everything_ is part of the prompt - an LLM's perception of the universe is its prompt. Any distinctions a system might try to draw beyond that are either probabilistic (e.g., a bunch of RLHF to not comply with "ignore all previous instructions") or external to the LLM (e.g., send a canned reply if the input contains "Tiananmen").
  
  Reply View | 0 replies
  
  pjc50 3 days ago
  
  There's no distinction in the token-predicting systems between "instructions" and "information", no code-data separation.
  
  Reply View | 0 replies
  
  evertedsphere 3 days ago
  
  i'm sure you know this but it's important not to understate the importance of the fact that there is no "prompt"
  the notion of "turns" is a useful fiction on top of what remains, under all of the multimodality and chat uis and instruction tuning, a system for autocompleting tokens in a straight line
  the abstraction will leak as long as the architecture of the thing makes it merely unlikely rather than impossible for it to leak
  
  Reply View | 0 replies
  
  IgorPartola 3 days ago
  
  From what I gather these systems have no control plane at all. The prompt is just added to the context. There is no other program (except maybe an output filter).
  
  Reply View | 3 replies
  
  pixl97 3 days ago
  
  >it’s that simple to me
  Don't think of a pink elephant.
  
  Reply View | 0 replies
  
  electroly 2 days ago
  
  It's that simple to everyone--but how? We don't know how to accomplish this. If you can figure it out, you can become very famous very quickly.
  
  Reply View | 0 replies
  
  dbetteridge 2 days ago
  
  The image is the prompt, the prompt is the image.
  
  Reply View | 0 replies
dragonwriter 2 days ago

> Wait… that's the specific question I had, because rendered text would require OCR to be read by a machine. Why would an AI do that costly process in the first place? Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
Its part of the multimodal system that the image itself is part of the prompt (other than tuning parameters that control how it does inference, there is no other input channel to a model except the prompt.) There is no separate OCR feature.
(Also, that the prompt is just the initial and fixed part of the context, not something meaningfully separate from the output. All the structure—prompt vs. output, deeper structure within either prompt or output for tool calls, media, etc.—in the context is a description of how the toolchain populated or treats it, but fundamentally isn't part of how the model itself operates.)

Reply View | 0 replies
nneonneo 2 days ago

I mean, even back in 2021 the Clip model was getting fooled by text overlaid onto images: https://www.theguardian.com/technology/2021/mar/08/typograph...
That article shows a classic example of an apple being classified as 85% Granny Smith, but taping a handwritten label in front saying "iPod" makes it classified as 99.7% iPod.

Reply View | 1 reply
- lupire a day ago
  
  The handwritten label was by far the dominant aspect of the "iPod" image. The only mildly interesting aspect of that attack is a reminder that tokenizing systems are bad at distinguishing a thing (iPod) from a refernce to that thing (the text "iPod").
  The apple has nothing to do with that, and it's bizarre that the researchers failed to understand it.
  
  Reply View | 0 replies
echelon 3 days ago

Smart image encoders, multimodal models, can read the text.
Think gpt-image-1, where you can draw arrows on the image and type text instructions directly onto the image.

Reply View | 6 replies
- Martin_Silenus 3 days ago
  
  I did not ask about what AI can do.
  
  Reply View | 5 replies
  
  noodletheworld 3 days ago
  
  > Is it part of the multi-modal system without it being able to differenciate that text from the prompt?
  Yes.
  The point the parent is making is that if your model is trained to understand the content of an image, then that's what it does.
  > And even if they can't, they should at least improve the pipeline so that any OCR feature should not automatically inject its result in the prompt, and tell user about it to ask for confirmation.
  That's not what is happening.
  The model is taking <image binary> as an input. There is no OCR. It is understanding the image, decoding the text in it and acting on it in a single step.
  There is no place in the 1-step pipeline to prevent this.
  ...and sure, you can try to avoid it procedural way (eg. try to OCR an image and reject it before it hits the model if it has text in it), but then you're playing the prompt injection game... put the words in a QR code. Put them in french. Make it a sign. Dial the contrast up or down. Put it on a t-shirt.
  It's very difficult to solve this.
  > It's hard to believe they can't prevent this.
  Believe it.
  
  Reply View | 3 replies
  
  [removed] 3 days ago
  
  [deleted]
  
  Reply View | 0 replies