Comment by noduerme

From the standpoint of using a human likeness, I don't see the difference between encoding a "conceptual representation" of Ford's face into a model and encoding it into any other digital or analog format from which it can later be decoded into a reasonable facsimile of the original.

I think that calling it a "conceptual representation" over-complicates the issue. At the very least, the model weights encode a process that can produce a copy of their training date. A 300x300 pixel image of Harrison Ford's face is one of what, like 1.5x10^12 possible images. Obviously, only a tiny fraction of all possible images are encoded in the model. Is encoding those particular weights into a diffuser which can select that face by a process of refinement really much different than, say, encoding the image into a set of fractal algorithms, or a set of vectors?

I'd argue that the largest models are akin to a compression method that has simply pre-encoded every word and image they've ingested, such that the "compressed file" is the prompt you give to the AI. Even with billions of weights trained on millions of texts and images, they've only encoded an infinitely tiny fraction of the entire space. Semantically you could call it something other than a "copy", but functionally how is it any different?