Comment by Retric
I think you may have something with that line of reasoning.
The threshold for transformative for fictional works is fairly high unfortunately. Fan fiction and reasonably distinct works with excessive inspiration are both copyright infringing. https://en.wikipedia.org/wiki/Tanya_Grotter
> Models themselves are very clearly transformative.
A near word for word copy of large sections of a work seems nowhere near that threshold. An MP3 isn’t even close to a 1:1 copy of a piece of music but the inherent differences are irrelevant, a neural network containing and allowing the extraction of information looks a lot like lossy compression.
Models could easily be transformative, but the justification needs to go beyond well obviously they are.
Models are not word for word copies of large sections of text. They are capable of emitting that text though.
It would be interesting to look at what legal precidents were set regarding mp3s or other encodings. Is the encoding itself an infringement, or is it the decoding, or is it the distribution of a decodable form of a work.
There is also the distinction with a lossy encoding that encodes a single work. There is clarity when the encoded form serves no other purpose other than to be decoded into a given work. When the encoding acts as a bulk archive, does the responsibility shift to those who choose what to extract from the archive?