Comment by tpmoney
At what point do you cross the line from "legitimate use of a work" to illegitimate use?
If I take my legally purchased epub of book and pipe it through `wc` and release the outputs, is that a violation of copyright? What about 10 books? 100? How many books would I have to pipe through `wc` before the outputs become a violation of copyright?
What if I take those same books and generate a spreadsheet of all the words and how frequently they're used? Again, same question, where is the line between "fine" and "copyright violation"?
What if I take that spreadsheet, load it into a website and make a javascript program that weights every word by count and then generates random text strings based on those weights? Is that not essentially an LLM in all but usefulness? Is that a violation of copyright now that I'm generating new content based on statistical information about copyright content? If I let such a program run long enough and run on enough machines, I'm sure those programs would generate strings of text from the works that went into the models. Is that what makes this a copyright violation?
If that's not a violation, how many other statistical transformation and weighting models would I have to add to my javascript program before it's a violation of copyright? I don't think it's reasonable to say any part of this is "clearly not" fair use, no matter how many books I pump into that original set of statistics. And at least so far, the US courts agree with that.
I think your analogy is a massive stretch. `wc` is neither generative nor capable of having market effect.
Your second construction is generative, but likely worse than a Markov chain model, which also did not have any market effect.
We're talking about the models that have convinced every VC it can make a trillion dollars from replacing millions of creative jobs.