Comment by asciisnowman

Comment by asciisnowman 17 hours ago

7 replies

On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer's Stone.

It's sold 120 million copies over 30 years. I've gotta think literally every passage is quoted online somewhere else a bunch of times. You could probably stitch together the full book quote-by-quote.

davidcbc 16 hours ago

If I collect HP quotes from the internet and then stitch them together into a book, can I legally sell access it?

bitmasher9 17 hours ago

Probably not?

Sure there are just ~75,000 words in HP1, and there are probably many times that amount in direct quotes online. However the quotes aren’t even distributed across the entire text. For every quote of charming the snake in a zoo there will be a thousand “you’re a wizard harry”, and those are two prominent plot points.

I suspect the least popular of all direct quotes from HP1 aren’t using the quotes in fair use, and are just replicating large sections of the novel.

Or maybe it really is just so popular that super nerds have quoted the entire novel arguing about the aspects of wand making, or the contents of every lecture.

tjpnz 15 hours ago

How many could do it from memory?

mvdtnz 17 hours ago

But also we know for a fact that Meta trained their models on pirated books. So there's no need to invent a hare brained scheme of stitching together bits and pieces like that.

  • kouteiheika 14 hours ago

    No, assuming that just because it was in the training data it must be memorized is hare brained.

    LLMs have limited capacity to memorize, under ~4 bits per parameter[1][2], and are trained on terabytes of data. It's physically impossible for them to memorize everything they're trained on. The model memorized chunks of Harry Potter not just because it was directly trained on the whole book, which the article also alludes to:

    > For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.

    In case it isn't obvious, both Harry Potter and Sandman Slim are parts of books3 dataset.

    [1] -- https://arxiv.org/abs/2505.24832 [2] -- https://arxiv.org/abs/2404.05405