Comment by Myrmornis
But lyrics are just one example. Are you saying that training experiments must filter out all substrings from the training input that bear too close a resemblance to a substring of a copyrighted work?
But lyrics are just one example. Are you saying that training experiments must filter out all substrings from the training input that bear too close a resemblance to a substring of a copyrighted work?
Obviously there's a limit, reproducing a single sentence is unlikely to be copyright infringement just because there are only so many words in a language; but if reproducing some text would be copyright infringement if a human did it, I don't see why LLM companies should get a free pass.
If it's really essential that they train their models on song lyrics, or books, or movie scripts, or articles, or whatever, they should pay license fees.