Comment by rpdillon
This is a point that I don't see discussed enough. I think anthropic decided to purchase books in bulk, tear them apart to scan them, and then destroy those copies. And that's the only source of copyrighted material I've ever heard of that is actually legal to use for training LLMs.
Most LLMs were trained on vast troves of pirated copyrighted material. Folks point this out, but they don't ever talk about what the alternative was. The content industries, like music, movies, and books, have done nothing to research or make their works available for analysis and innovation, and have in fact fought industries that seek to do so tooth and nail.
Further, they use the narrative that people that pirate works are stealing from the artists, where the vast majority of money that a customer pays for a piece of copyrighted content goes to the publishing industry. This is essentially the definition of rent seeking.
Those industries essentially tried to stop innovation entirely, and they tried to use the law to do that (and still do). So, other companies innovated over the copyright holder's objections, and now we have to sort it out in the courts.
> So, other companies innovated over the copyright holder's objections, and now we have to sort it out in the courts.
I think they try to expand copyright from "protected expression" to "protected patterns and abstractions", or in other words "infringement without substantial similarity". Otherwise why would they sue AI companies? It makes no sense:
1. If I wanted a specific author, I would get the original works, it is easy. Even if I am cheap it is still much easier to pirate than use generative models. In fact AI is the worst infringement tool ever invented - it almost never reproduces faithfully, it is slow and expensive to use. Much more expensive than copying which is free, instant and makes perfect replicas.
2. If I wanted AI, it means I did not want the original, I wanted something Else. So why sue people who don't want the originals? The only reason to use AI is when you want to steer the process to generate something personalized. It is not to replace the original authors, if that is what I needed no amount of AI would be able to compare to the originals. If you look carefully almost all AI outputs get published in closed chat rooms, with a small fraction being shared online, and even then not in the same venues as the original authors. So the market substitution logic is flimsy.