Comment by gpm

Comment by gpm 18 hours ago

6 replies

I think it's important to recognize here that fanfiction.net has 850 thousand distinct pieces of Harry Potter fanction on it. Fifty thousand of which are more than 40k words in length. Many of which (no easy way to measure) directly reproducing parts of the original books.

archiveofourown.org has 500 thousand, some, but probably not the majority, of that are duplicated from fanfiction.net. 37 thousand of these are over 40 thousand words.

I.e. harry potter and its derivatives presumably appear a million times in the training set, and its hard to imagine a model that could discuss this cultural phenomena well without knowing quite a bit about the source material.

aprilthird2021 17 hours ago

Did you read the article? This exact point is made and then analyzed.

> Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.

> “If it were citations and quotations, you'd expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

  • gpm 17 hours ago

    The article fails to mention or understand the volume of content here. Every, literally every, part of these books is quoted and "talked about" (in the sense of used in unlicensed derivative works).

    And yes, I read the article before commenting. I don't appreciate the baseless insinuation to the contrary.

    • 1123581321 17 hours ago

      Agreed. It’s an obtuse quote by Lemley who can’t picture the enormous quantity of associations and crawled data, or at least wants to minimize the quantity. It’s hardly discussion-ending.

      Accusations of not reading the article are fair when someone brings up a “related” anecdote that was in the article. It’s not fair when someone is just disagreeing.

    • davidcbc 17 hours ago

      Even assuming you are correct, which I'm skeptical of, does this make it better?

      It's essentially the same thing, they are copying from a source that is violating copyright, whether that's a pirated book directly or a pirated book via fanficton.

      • gpm 17 hours ago

        Generally I think it matters a great deal to get the facts right when discussing something with nuance.

        Is this specific fact required to make my beliefs consistent... Yes I think it is, but if you disagree with me in other ways it might not be important to your beliefs.

        Legally (note: not a lawyer) I'm generally of the opinion that

        A) Torrenting these books was probably copyright infringement on Meta's part. They should have done so legally by scanning lawfully acquired copies like Google did with Google Books.

        B) Everything else here that Meta did falls under the fair use and de minimis exceptions to copyrights prohibition on copying copyrighted works without a license.

        And if it was copying significant amounts of a work that appeared only once in its training set into the model the de minimis argument would fall apart.

        Morally I'm of the opinion that copyright law's prohibition on deeply interacting with our cultural artifacts by creating derivative works is incredibly unfair and bad for society. This extends to a belief that the communities that do this should not be excluded from technological developments because there entire existence is unjustly outlawed.

        Incidentally I don't believe that browsing a site that complies with the DMCA and viewing what it lawfully serves you constitutes piracy, so I can't agree with your characterization of events either. The fanfiction was not pirated just because it was likely unlawful to produce in the US.

        • [removed] 10 hours ago
          [deleted]