Comment by zmmmmm
Comment by zmmmmm 17 hours ago
It's important to note the way it was measured:
> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time
As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.
So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.
Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.
> So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own.
That’s what I was thinking as I read the methodology.
If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?