Comment by mvdtnz

Comment by mvdtnz 18 hours ago

3 replies

But also we know for a fact that Meta trained their models on pirated books. So there's no need to invent a hare brained scheme of stitching together bits and pieces like that.

kouteiheika 15 hours ago

No, assuming that just because it was in the training data it must be memorized is hare brained.

LLMs have limited capacity to memorize, under ~4 bits per parameter[1][2], and are trained on terabytes of data. It's physically impossible for them to memorize everything they're trained on. The model memorized chunks of Harry Potter not just because it was directly trained on the whole book, which the article also alludes to:

> For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.

In case it isn't obvious, both Harry Potter and Sandman Slim are parts of books3 dataset.

[1] -- https://arxiv.org/abs/2505.24832 [2] -- https://arxiv.org/abs/2404.05405