Comment by paxys
Comment by paxys 17 hours ago
As an experiment I searched Google for "harry potter and the sorcerer's stone text":
- the first result is a pdf of the full book
- the second result is a txt of the full book
- the third result is a pdf of the complete harry potter collection
- the fourth result is a txt of the full book (hosted on github funny enough)
Further down there are similar copies from the internet archive and dozens of other sites. All in the first 2-3 pages.
I get that copyright is a problem, but let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.
> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy
No one is claiming this.
The corporations developing LLMs are doing so by sampling media without their owners' permission and arguing this is protected by US fair use laws, which is incorrect - as the late AI researcher Suchir Balaji explained in this other article:
https://suchir.net/fair_use.html