Comment by mvdtnz
But also we know for a fact that Meta trained their models on pirated books. So there's no need to invent a hare brained scheme of stitching together bits and pieces like that.
But also we know for a fact that Meta trained their models on pirated books. So there's no need to invent a hare brained scheme of stitching together bits and pieces like that.
No, we know it because it was established in court from Meta internal communications.
https://www.theguardian.com/technology/2025/jan/10/mark-zuck...
I'm confused. Nowhere in my post have I said that they didn't?
No, assuming that just because it was in the training data it must be memorized is hare brained.
LLMs have limited capacity to memorize, under ~4 bits per parameter[1][2], and are trained on terabytes of data. It's physically impossible for them to memorize everything they're trained on. The model memorized chunks of Harry Potter not just because it was directly trained on the whole book, which the article also alludes to:
> For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.
In case it isn't obvious, both Harry Potter and Sandman Slim are parts of books3 dataset.
[1] -- https://arxiv.org/abs/2505.24832 [2] -- https://arxiv.org/abs/2404.05405