Comment by jdietrich
It's reasonably well established that large neural networks don't contain copies of the training data, therefore their outputs can't be considered copies of anything. The model might contain a conceptual representation of Harrison Ford's face, but that's very different to a verbatim representation of a particular copyrighted image of Harrison Ford. Model weights aren't copyrightable; it's plausible that model outputs aren't copyrightable, but there are some fairly complicated arguments around authorship. Training an AI model on copyrighted work is highly likely to be fair use under US law, but plausibly isn't fair dealing under British law or a permitted use under Article 5 of the EU Copyright and Information Society Directive.
All of that is entirely separate from trademark law, which would prevent you from using any representation of a trademarked character unless e.g. you can reasonably argue that you are engaged in parody.
From the standpoint of using a human likeness, I don't see the difference between encoding a "conceptual representation" of Ford's face into a model and encoding it into any other digital or analog format from which it can later be decoded into a reasonable facsimile of the original.
I think that calling it a "conceptual representation" over-complicates the issue. At the very least, the model weights encode a process that can produce a copy of their training date. A 300x300 pixel image of Harrison Ford's face is one of what, like 1.5x10^12 possible images. Obviously, only a tiny fraction of all possible images are encoded in the model. Is encoding those particular weights into a diffuser which can select that face by a process of refinement really much different than, say, encoding the image into a set of fractal algorithms, or a set of vectors?
I'd argue that the largest models are akin to a compression method that has simply pre-encoded every word and image they've ingested, such that the "compressed file" is the prompt you give to the AI. Even with billions of weights trained on millions of texts and images, they've only encoded an infinitely tiny fraction of the entire space. Semantically you could call it something other than a "copy", but functionally how is it any different?