Comment by jsheard
It is known that the LAION dataset underpinning foundation models like Stable Diffusion contained at least a few thousand instances of real-life CSAM at one point. I think you would be hard-pressed to prove that any model trained on internet scrapes definitively wasn't trained on any CSAM whatsoever.
https://www.theverge.com/2023/12/20/24009418/generative-ai-i...
> I think you would be hard-pressed to prove that any model trained on internet scrapes definitively wasn't trained on any CSAM whatsoever.
I'd be hard pressed to prove that you definitely hadn't killed anybody ever.
Legally if it's asserted that these images are criminal because they are the result of being the product of an LLM trained on sources that contained CSAM then the requirement would be to prove that assertion.
With text and speech you could prompt the model to exactly reproduce a Sarah Silverman monologue and assert that proves her content was used in the training set, etc.
Here the defense would ask the prosecution to demonstrate how to extract a copy of original CSAM.
But your point is well taken, it's likely most image generation programs of this nature have been fed at least one image that was borderline jailbait and likely at least one that was well below the line.