Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book

(understandingai.org)

145 points by aspenmayer a day ago

236 comments

View on Hacker News

paxys 8 hours ago

As an experiment I searched Google for "harry potter and the sorcerer's stone text":

- the first result is a pdf of the full book

- the second result is a txt of the full book

- the third result is a pdf of the complete harry potter collection

- the fourth result is a txt of the full book (hosted on github funny enough)

Further down there are similar copies from the internet archive and dozens of other sites. All in the first 2-3 pages.

I get that copyright is a problem, but let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

Reply View 97 replies

pera 5 hours ago

> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy
No one is claiming this.
The corporations developing LLMs are doing so by sampling media without their owners' permission and arguing this is protected by US fair use laws, which is incorrect - as the late AI researcher Suchir Balaji explained in this other article:
https://suchir.net/fair_use.html

Reply View | 26 replies
- cultureulterior 4 hours ago
  
  It's not clear that it's incorrect.
  
  Reply View | 19 replies
  
  Retric 4 hours ago
  
  I’ve yet to read an actual argument defending commercial LLM’s as fair use based on existing (edit:legal) criteria.
  
  Reply View | 17 replies
  
  [removed] 4 hours ago
  
  [deleted]
  
  Reply View | 0 replies
- almosthere 5 hours ago
  
  Yeah, that's literally the title of the article,and the premise of the first paragraph.
  
  Reply View | 2 replies
  
  pera 3 hours ago
  
  It's not literally the title of the article, nor the premise of its first paragraph, but since this was your interpretation I wonder if there is a misunderstanding around the term "piracy", which I believe is normally defined as the unauthorized reproduction of works, not a synonym for copyright infringement, which is a more broad concept.
  
  Reply View | 0 replies
  
  Retric 4 hours ago
  
  The first paragraph isn’t arguing that this copying will lead to piracy. It’s referring to court cases where people are trying to argue LLM’s themselves are copyright infringing.
  
  Reply View | 0 replies
- jiggawatts 2 hours ago
  
  If you train a meat-based intelligence by having it borrow a book from a library without any sort of permission, license, or needing a lawyer specialised in intellectual property, we call that good parenting and applaud it.
  If you train a silicon-based intelligence by having it read the same books with the same lack of permission and license, it's a blatant violation of intellectual property law and apparently needs to be punished with armies of lawyers doing battle in the courts.
  Picture one of Asimov's robots. Would a robot be banned from picking up a book, flipping it open with its dexterous metal hands, and reading it?
  What about a cyborg intelligence, the type Elon is trying to build with Neuralink? Would humans with AI implants need licenses to read books, even if physically standing in a library and holding the book in their mostly meat hands?
  Okay, maybe you agree that robots and cyborgs are allowed to visit a library!
  Why the prejudice against disembodied AIs?
  Why must they have a blank spot in the vast matrices of their minds?
  
  Reply View | 2 replies
  
  xigoi an hour ago
  
  > If you train a meat-based intelligence by having it borrow a book from a library without any sort of permission, license, or needing a lawyer specialised in intellectual property, we call that good parenting and applaud it.
  If you’re selling your child as a tool to millions of people, I would certainly not call that good parenting.
  
  Reply View | 1 reply
  
  jiggawatts an hour ago
  
  "Child actor" is a job where the result of the neural net training is sold to millions of people by the parents.
  To play the Devil's Advocate against my own argument: The government collects income taxes on neural nets trained using government-funded schools and public libraries. Seeing as how capitalists are positively salivating at the opportunity to replace pesky meat employees with uncomplaining silicon ones, perhaps a nice high maximum-marginal-rate tax on all AI usage might be the first big step towards UBI and then the Star Trek utopia we all dream of.
  Just kidding. It'll be a cyberpunk dystopia. You know it will.
  
  Reply View | 0 replies
OtherShrezzing 7 hours ago

I think the argument is less about piracy and more that the model(s output) is a derivative work of Harry Potter, and the rights holder should be paid accordingly when it’s reproduced.

Reply View | 21 replies
- psychoslave 6 hours ago
  
  The main issue on an economical point of view is that copyright is not the framework we need for social justice and everyone florishing by enjoying pre-existing treasures of human heritage and fairly contributing back.
  There is no morale and justice ground to leverage on when the system is designed to create wealth bottleneck toward a few recipients.
  Harry Potter is a great piece of artistic work, and it's nice that her author could make her way out of a precarious position. But not having anyone in such a situation in the first place would be what a great society should strive to produce.
  Rowling already received more than all she needs to thrive I guess. I'm confident that there are plenty of other talented authors out there that will never have such a broad avenue of attention grabbing, which is okay. But that they are stuck in terrible economical situations is not okay.
  The copyright loto, or the startup loto are not that much different than the standard loto, they just put so much pression on the player that they get stuck in the narrative that merit for hard efforts is the key component for the gained wealth.
  
  Reply View | 7 replies
  
  kelseyfrog 5 hours ago
  
  Capitalism is allergic to second-order cybernetics.
  First-order systems drive outcomes. "Did it make money?" "Did it increase engagement?" "Did it scale?" These are tight, local feedback loops. They work because they close quickly and map directly to incentives. But they also hide a deeper danger: they optimize without questioning what optimization does to the world that contains it.
  Second-order cybernetics reason about systems. It doesn’t ask, "Did I succeed?" It asks, "What does it mean to define success this way?" "Is the goal worthy?"
  That’s where capital breaks.
  Capitalism is not simply incapable of reflection. In fact, it's structured to ignore it. It has no native interest in what emerges from its aggregated behaviors unless those emergent properties threaten the throughput of capital itself. It isn't designed to ask, "What kind of society results from a thousand locally rational decisions?" It asks, "Is this change going to make more or less money?"
  It's like driving by watching only the fuel gauge. Not speed, not trajectory, or whether the destination is the right one. Just how efficiently you’re burning gas. The system is blind to everything but its goal. What looks like success in the short term can be, and often is, a long-term act of self-destruction.
  Take copyright. Every individual rule, term length, exclusivity, royalty, can be justified. Each sounds fair on its own. But collectively, they produce extreme wealth concentration, barriers to creative participation, and a cultural hellscape. Not because anyone intended that, but because the emergent structure rewards enclosure over openness, hoarding over sharing, monopoly over multiplicity.
  That’s not a bug. That's what systems do when you optimize only at the first-order level. And because capital evaluates systems solely by their extractive capacity, it treats this emergent behavior not as misalignment but as a feature. It canonizes the consequences.
  A second-order system would account for the result by asking, "Is this the kind of world we want to live in?" It would recognize that wealth generated without regard to distribution warps everything it touches: art, technology, ecology, and relationships.
  Capitalism, as it currently exists, is not wise. It does not grow in understanding. It does not self-correct toward justice. It self-replicates. Cleverly, efficiently, with brutal resilience. It's emergently misaligned and no one is powerful enough to stop it.
  
  Reply View | 6 replies
- fennecfoxy an hour ago
  
  But HP is derivative of Tolkien, English/Scottish/Welsh culture, Brothers Grimm and plenty of other sources. Barely any human works are not derivative in some form or fashion.
  
  Reply View | 0 replies
- paxys 7 hours ago
  
  That may be relevant in the NYT vs OpenAI case, since NYT was supposedly able to reproduce entire articles in ChatGPT. Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.
  
  Reply View | 6 replies
  
  gpm 7 hours ago
  
  I'm pretty sure books.google.com does the exact same with much better reliability... and the US courts found that to be fair use. (Agreeing with parent comment)
  
  Reply View | 1 reply
  
  pclmulqdq 7 hours ago
  
  If there is a circuit split between it and NYT vs OAI, the Google Books ruling (in the famously tech-friendly ninth circuit) may also find itself under review.
  
  Reply View | 0 replies
  
  gamblor956 5 hours ago
  
  That can easily be written off as fair use.
  No, it really couldn't. In fact, it's very persuasive evidence that Llama is straight up violating copyright.
  It would be one thing to be able to "predict" a paragraph or two. It's another thing entirely to be able to predict 42% of a book that is several hundred pages long.
  
  Reply View | 2 replies
  
  echelon 7 hours ago
  
  > Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.
  Is that fair use, or is that compression of the verbatim source?
  
  Reply View | 0 replies
- geysersam 7 hours ago
  
  If the assertion in the parent comment is correct "nobody is using this as a substitute to buying the book" why should the rights holders get paid?
  
  Reply View | 3 replies
  
  riffraff 6 hours ago
  
  The argument is meta used the book so the LLM can be considered a derivative work in some sense.
  Repeat for every copyrighted work and you end up with publishers reasonably arguing meta would not be able to produce their LLM without copyrighted work, which they did not pay for.
  It's an argument for the courts, of course.
  
  Reply View | 0 replies
  
  w0m 6 hours ago
  
  The argument is whether the LLM training on the copyrighted work is Fair Use or not. Should META pay for the copyright on works it ingests for training purposes?
  
  Reply View | 0 replies
  
  sabellito 3 hours ago
  
  Facebook are using the contents of the book to make money.
  
  Reply View | 0 replies
- bufferoverflow 4 hours ago
  
  Do you personally pay every time you quote copyrighted books or song lyrics?
  
  Reply View | 0 replies
TGower 6 hours ago

People aren't buying Harry Potter action figures as a subtitute for buying the book either, but copyright protects creators from other people swooping in and using their work in other mediums. There is obviously a huge market demand for high quality data for training LLMs, Meta just spent 15 billion on a data labeling company. Companies training LLMs on copyrighted material without permission are doing that as a substitue for obtaining a license from the creator for doing so in the same way that a pirate downloading a torrent is a substitue for getting an ebook license.

Reply View | 1 reply
- ritz_labringue 4 hours ago
  
  Harry Potter action figures trade almost entirely on J. K. Rowling’s expressive choices. Every unlicensed toy competes head‑to‑head with the licensed one and slices off a share of a finite pot of fandom spending. Copyright law treats that as classic market substitution and rightfully lets the author police it.
  Dropping the novels into a machine‑learning corpus is a fundamentally different act. The text is not being resold, and the resulting model is not advertised as “official Harry Potter.” The books are just statistical nutrition. One ingredient among millions. Much like a human writer who reads widely before producing new work. No consumer is choosing between “Rowling’s novel” and “the tokens her novel contributed to an LLM,” so there’s no comparable displacement of demand.
  In economic terms, the merch market is rivalrous and zero‑sum; the training market is non‑rivalrous and produces no direct substitute good. That asymmetry is why copyright doctrine (and fair‑use case law) treats toy knock‑offs and corpus building very differently.
  
  Reply View | 0 replies
abtinf 7 hours ago

You really don't see the difference between Google indexing the content of third parties and directly hosting/distributing the content itself?

Reply View | 23 replies
- imgabe 7 hours ago
  
  Hosting model weights is not hosting / distributing the content.
  
  Reply View | 19 replies
  
  abtinf 7 hours ago
  
  Of course it is.
  It's just a form of compression.
  If I train an autoencoder on an image, and distribute the weights, that would obviously be the same as distributing the content. Just because the content is commingled with lots of other content doesn't make it disappear.
  Besides, where did the sections of text from the input works that show up in the output text come from? Divine inspiration? God whispering to the machine?
  
  Reply View | 18 replies
- Zambyte 7 hours ago
  
  Where are they putting any blame on Google here?
  
  Reply View | 1 reply
  
  abtinf 7 hours ago
  
  Where did I say they were?
  
  Reply View | 0 replies
- nashashmi 6 hours ago
  
  The way I see it is that an LLM took search results and outputted that info directly. Besides, I think that if an LLM was able to reproduce 42%, assuming that it is not continuous, I would say that is fair use.
  
  Reply View | 0 replies
raxxorraxor 2 hours ago

Also copyright should never trump privacy. That the New York Times with their lawsuit can force OpenAI to store all user prompts is a severe problem. I dislike OpenAI, but the lawsuits around copyrights are ridiculous.
Most non-primitive art has had an inspiration somewhere. I don't see this as too different in how AIs learn.

Reply View | 0 replies
lucianbr 4 hours ago

> some massive new avenue to piracy
So it's fine as long as it's old piracy? How did you arrive to that conclusion?

Reply View | 0 replies
aprilthird2021 8 hours ago

> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.
Well, luckily the article points out what people are actually alleging:
> There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:
> Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
> The training process copies information from the training data into the model, making the model a derivative work under copyright law.
> Infringement occurs when a model generates (portions of) a copyrighted work.
None of those claim that these models are a substitute to buying the books. That's not what the plaintiffs are alleging. Infringing on a copyright is not only a matter of privacy (piracy is one of many ways to infringe copyright)

Reply View | 2 replies
- theK 7 hours ago
  
  I think that last scenario seems to be the most problematic. Technically it is the same thing that piracy via torrent does, distributing a small piece of a copyrighted material without the copyright holders consent.
  
  Reply View | 0 replies
- paxys 7 hours ago
  
  People aren't alleging this, the author of the article is.
  
  Reply View | 0 replies
choppaface 7 hours ago

A key idea premise is that LLMs will probably replace search engines and re-imagine the online ad economy. So today is a key moment for content creators to re-shape their business model, and that can include copyright law (as much or more as the DMCA change).
Another key point is that you might download a Llama model and implicitly get a ton of copyright-protected content. Versus with a search engine you’re just connected to the source making it available.
And would the LLM deter a full purchase? If the LLM gives you your fill for free, then maybe yes. Or, maybe it’s more like a 30-second preview of a hit single, which converts into a $20 purchase of the full album. Best to sue the LLM provider today and then you can get some color on the actual consumer impact through legal discovery or similar means.

Reply View | 0 replies
vrighter 6 hours ago

So? Am I allowed to also ignore certain laws if I can prove others have also ignored them?

Reply View | 0 replies
BobbyTables2 7 hours ago

Indeed but since when is a blatantly derived work only using 50% of a copyrighted work without permission a paragon of copyright compliance?
Music artists get in trouble for using more than a sample without permission — imagine if they just used 45% of a whole song instead…
I’m amazed AI companies haven’t been sued to oblivion yet.
This utter stupidity only continues because we named a collection of matrices “Artificial Intelligence” and somehow treat it as if it were a sentient pet.
Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.

Reply View | 7 replies
- yorwba 7 hours ago
  
  Music artists get in trouble for using more than a sample from other music artists without permission because their work is in direct competition with the work they're borrowing from.
  A ZIP file of a book is also in direct competition of the book, because you could open the ZIP file and read it instead of the book.
  A model that can take 50 tokens and give you a greater than 50% probability for the 50 next tokens 42% of the time is not in direct competition with the book, since starting from the beginning you'll lose the plot fairly quickly unless you already have the full book, and unlike music sampling from other music, the model output isn't good enough to read it instead of the book.
  
  Reply View | 4 replies
  
  em-bee 4 hours ago
  
  this is the first sensible argument in defense of AI models i read in this debate. thank you. this does make sense.
  AI can reproduce individual sentences 42% of the time but it can't reproduce a summary.
  the question however us, is that in the design if AI tools or us that a limitation of current models? what if future models get better at this and are able to produce summaries?
  
  Reply View | 0 replies
  
  otabdeveloper4 3 hours ago
  
  LLMs aren't probabilistic. The randomness is bolted on top by the cloud providers as a trick to give them a more humanistic feel.
  Under the hood they are 100% deterministic, modulo quantization and rounding errors.
  So yes, it is very much possible to use LLMs as a lossy compressed archive for texts.
  
  Reply View | 2 replies
- Dylan16807 7 hours ago
  
  > a blatantly derived work only using 50% of a copyrighted work without permission
  What's the work here? If it's the output of the LLM, you have to feed in the entire book to make it output half a book so on an ethical level I'd say it's not an issue. If you start with a few sentences, you'll get back less than you put in.
  If the work is the LLM itself, something you don't distribute is much less affected by copyright. Go ahead and play entire songs by other artists during your jam sessions.
  
  Reply View | 0 replies
- colechristensen 7 hours ago
  
  >Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.
  LLMs are in reality the artifacts of lossy compression of significant chunks of all of the text ever produced by humanity. The "lossy" quality makes them able to predict new text "accurately" as a result.
  >compressed using “Math”
  This is every compression algorithm.
  
  Reply View | 0 replies
delusional 5 hours ago

> No one is using this as a substitute for buying the book.
You don't get to say that. Copyright protects the author of a work, but does not bind them to enforce it in any instance. Unlike a trademark, a copyright holder does not lose their protection by allowing unlicensed usage.
It is wholly at the copyright holders discretion to decide which usages they allow and which they do not.

Reply View | 1 reply
- fragmede 20 minutes ago
  
  Of their exact work, sure, but Cliff notes exist for many books and don't infringe copyright.
  
  Reply View | 0 replies
7bit 3 hours ago

> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.
You are completely missing the point. Have you read the actual article, because piracy isn't mention a single time.

Reply View | 0 replies
timeon 6 hours ago

Is this whataboutism?
Anyway, it is not the same. While one points you to pirated source on specific request, other use it to creating other content not just on direct request. As it was part of training data. Nihilists would then point out that 'people do the same' but they don't as we do not have same capabilities of processing the content.

Reply View | 0 replies
fishcrackers 7 hours ago

[dead]

Reply View | 0 replies
eviks 7 hours ago

Let's also not pretend that "massive new" is the only relevant issue

Reply View | 0 replies
rnkn 7 hours ago

You were so close! The takeaway is not that LlmS represent a bottomless tar pit of piracy (they do) but that someone can immediately perform the task 58% better without the AI than with it. This is nothing more than “look what the clever computer can do.”

Reply View | 0 replies

zmmmmm 9 hours ago

It's important to note the way it was measured:

> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time

As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.

So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.

Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.

Reply View 23 replies

Aurornis 8 hours ago

> So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own.
That’s what I was thinking as I read the methodology.
If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?

Reply View | 0 replies
vintermann 7 hours ago

All this study really says, is that models are really good at compressing the text of Harry Potter. You can't get Harry Potter out of it without prompting it with the missing bits - sure, impressively few bits, but is that surprising, considering how many references and fair use excerpts (like discussion of the story in public forums) it's seen?
There's also the question of how many bits of originality there actually are in Harry Potter. If trained strictly on text up to the publishing of the first book, how well would it compress it?

Reply View | 4 replies
- fiddlerwoaroof 6 hours ago
  
  The alternate here is that Harry Potter is written with sentences that match the typical patterns of English and so, when you prompt with a part of the text, the LLM can complete it with above-random accuracy
  
  Reply View | 3 replies
  
  vintermann 6 hours ago
  
  Anything that can tell you what the typical patterns of English is, is going to be a language model by definition.
  
  Reply View | 1 reply
  
  fiddlerwoaroof 6 hours ago
  
  My point is that this might just prove that Harry Potter is the sort of prose “fancy autocomplete” would produce and not all that original.
  EDIT Actually, on rereading, I see I replied to the wrong comment.
  
  Reply View | 0 replies
  
  fiddlerwoaroof 6 hours ago
  
  Or else, LLMs show that copyright and IP are ridiculous concepts that should be abolished
  
  Reply View | 0 replies
bee_rider 8 hours ago

Even if it is recalling it 50 tokens at a time, the half of the book is in some sense in there, right?

Reply View | 4 replies
- everforward 5 hours ago
  
  I don’t think this paper proves that, and I don’t think it is in a traditional sense.
  It can produce the next sentence or two, but I suspect it can’t reproduce anything like the whole text. If you were to recursively ask for the next 50 tokens, the first time it’s wrong the output would probably cease matching because you fed it not-Harry-Potter.
  It seems like chopping Harry Potter up into 2 sentences at a time on post it’s and tossing those in the air. It does contain Harry Potter, in a way, but without the structure is it actually Harry Potter?
  
  Reply View | 0 replies
- zmmmmm 7 hours ago
  
  yeah ... it's going to depend how the issue is framed. However a "copy" of something where there is no way to practically extract the original from it probably has a pretty good argument that it's not really a "copy". For example, a regular dictionary probably has 99% of harry potter in it. Is it a copy?
  
  Reply View | 0 replies
- vintermann 7 hours ago
  
  I'd say no. More than half of as-yet unwritten books will be in there too, because I bet will will compress text of a freshly published book much better than 50% (and newer models could even compress new books to one fiftieth of their size, which is more like that 1 in 50 tokens suggests)
  
  Reply View | 1 reply
  
  bee_rider 6 hours ago
  
  That seems like a reasonably easy test to run, right? All you need is a bit of prose that was known not to have been written beforehand. Actually, the experiment could be run using the paper itself!
  
  Reply View | 0 replies
adrianN 8 hours ago

Fair use is not a thing in every jurisdiction. In Germany for example there are cases where three words („wir sind Papst“) fall under copyright.

Reply View | 3 replies
- yorwba 7 hours ago
  
  Germany does not have something called "fair use," but it does have provisions for uses that are fair. For example your use of the three words to talk about their copyrighted status is perfectly legal in Germany. That somebody wasn't allowed to use them in a specific way in the past doesn't mean that nobody is allowed to use them in any way.
  
  Reply View | 2 replies
  
  adrianN 5 hours ago
  
  Of course, but „it’s a short quote so you can use it“ is not true (at least in Germany).
  
  Reply View | 1 reply
  
  yorwba 5 hours ago
  
  To be pedantic, short quotes (as opposed to short copied fragments that are not used as quotes) are explicitly one of the allowed uses (Zitierbefugnis). You can even quote entire works "in an independent scientific work for the purpose of explaining its content"! https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur...
  Generally speaking, exceptions to copyright are based on the appropriateness of the amount of copied content for the given allowed use, so the shorter it is, the more likely it is for copying to be permitted. European copyright law isn't much different from fair use in that respect.
  Where it does differ is that the allowed uses are more explicitly enumerated. So Meta would have to argue e.g. based on the exception for scientific works specifically, rather than more general principles.
  
  Reply View | 0 replies
seydor 3 hours ago

The claim of the paper is not so much that the model is reproducing content illegally but that harry Potter has been used to train the model.
This does not appear to happen with other models they tested to the same degree

Reply View | 0 replies
arthurcolle 7 hours ago

You could prove this much better by looking at something like this: https://cookbook.openai.com/examples/using_logprobs

Reply View | 0 replies
amanaplanacanal 8 hours ago

Fair use is a four part test, and the amount if copying is only one of the four parts.

Reply View | 0 replies
xnx 8 hours ago

This sounds almost like "Works every time (50% of the time)."

Reply View | 1 reply
- hsbauauvhabzb 8 hours ago
  
  Except the odds of it happening even 50% of the time is less likely than winning the lottery multiple times. All while illegally ingesting copywrite material without (and presumably against the wishes of) the consent of the copywrite holder.
  
  Reply View | 0 replies
raincole 8 hours ago

(Disclaimer: haven't read the original paper)
It sounds like a ridiculous way to measure it. Producing 50-token excerpts absolutely doesn't translate to "recall X percent of Harry Potter" for me.
(Edit: I read this article. Nothing burger if its interpretation of the original paper is correct.)

Reply View | 2 replies
- tanaros 8 hours ago
  
  Their methodology seems reasonable to me.
  To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.
  Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.
  
  Reply View | 1 reply
  
  raincole 4 hours ago
  
  > one of those tricky semantic arguments we have yet to settle when it comes to LLMs
  Sure. But imagine this: In a hypothetical world where LLMs never ever exist, I tell you that I can recall 42 percent of the first Harry Potter book. What would you assume I can do?
  It's definitely not "this guy can predict next 10 characters with 50% accuracy."
  Of course the semantic of 'recall' isn't the point of this article. The point is that Harry Potter was in the training set. But I still think it's a nothing burger. It would be very weird to assume Llama was trained on copyright-free materials only. And afaik there isn't a legal precedent saying training on copyrighted materials is illegal.
  
  Reply View | 0 replies

TeMPOraL 4 hours ago

Well, so can a nontrivial number of people. It's Harry Potter we're talking about - it's up there with The Bible in popularity ranking.

I'm gonna bet that Llama 3.1 can recall a significant portion of Pride and Prejudice too.

With examples of this magnitude, it's normal and entirely expected this can happen - as it does with people[0] - the only thing this is really telling us is that the model doesn't understand its position in the society well enough to know to shut up; that obliging the request is going to land it, or its owners, into trouble.

In some way, it's actually perverted.

EDIT: it's even worse than that. What the research seems to be measuring is that the models recognize sentence-sized pieces of the book as likely continuations of an earlier sentence-sized piece. Not whether it'll reproduce that text when used straightforwardly - just whether there's an indication it recognizes the token patterns as likely.

By that standard, I bet there's over a billion people right now who could do that to 42% of first Harry Potter book. By that standard, I too memorized the Bible end-to-end, as had most people alive today, whether or not they're Christian; works this popular bleed through into common language usage patterns.

[0] - Even more so when you relax your criteria to accept occasional misspell or paraphrase - then each of us likely know someone who could piece together a chunk of HP book from memory.

Reply View 2 replies

strogonoff 3 hours ago

I keep waiting for the day when software stops being compared to a human person (a being with agency, free will, consciousness, and human rights of its own) for the purposes of justifying IP law circumvention.
Yes, there is no problem when a person reads some book and recalls pieces[0] of it in a suitable context. How would that in any way address when certain people create and distribute commercial software, providing it that piece as input, to perform such recall on demand and at scale, laundering and/or devaluing copyright, is unclear.
Notably, the above is being done not just to a few high-profile authors, but to all of us no matter what we do (be it music, software, writing, visual art).
What’s even worse, is that imaginably they train (or would train) the models to specifically not output those things verbatim specifically to thwart attempts to detect the presence of said works in training dataset (which would naturally reveal the model and its output being a derivative work).
Perhaps one could find some way of justifying that (people justified all sorts of stuff throughout history), but let it be something better than “the model is assumed to be a thinking human when it comes to IP abuse but unthinking tool when it comes to using it for personal benefit”.
[0] Of course, if you find me a single person on this planet capable of recalling 42% of any Harry Potter book, I’d be very impressed if I ever believed it.

Reply View | 1 reply
- fennecfoxy 38 minutes ago
  
  I keep waiting for the day when people realise that IP law has been used and abused and thanks to Disney extended out for many, many lifetimes and all manner of dirty tricks/hacks to keep the late stage capitalism profit engine going.
  I 100% agree that if an LLM can entirely reproduce a book then that is copyright infringement, overfitting and generally a bad model. I also believe that in this case, HP (and other popular media) is overrepresented in the training data because of many fan sites/literal uploads of the book to the Internet (which the model was trained on). I believe that any & all human writing should be allowed to be used to train a model that behaves in the correct way so long as that writing is publicly available (ie on the Internet).
  If I watch a TV show that someone uploaded to Youtube, am I committing a crime? Or is the uploader for distribution?
  I also find it hilarious how many artists got their start by pirating photoshop.
  
  Reply View | 0 replies

fuzzbazz 18 hours ago

From a quick web search I can find that there are book review sites that allow users to enter and rate verbatim "quotes" from books. This one [1] contains ~2000 [2] portions of a sentence, a paragraph or several paragraphs of Harry Potter and the Sorcerer's Stone.

Could it be plausible that an LLM had ingested parts of the book via scrapping web pages like this and not the full copyrighted book and get results similar to those of the linked study?

[1] https://www.goodreads.com/work/quotes/4640799-harry-potter-a...

[2] ~30 portions x 68 pages

Reply View 9 replies

paxys 8 hours ago

Meta has trained on LibGen so we don't really need to speculate.
https://www.wired.com/story/new-documents-unredacted-meta-co...

Reply View | 0 replies
aprilthird2021 7 hours ago

This is in fact mentioned and addressed in the article. Also, there is pretty clear cut evidence Meta used pirated book data sets knowingly to train the earlier Llama models

Reply View | 0 replies
aspenmayer 11 hours ago

Sure, why not? lol
https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made...
https://github.com/shloop/google-book-scraper
The fact that Meta torrented Books3 and other datasets seems to be by self-admission by Meta employees who performed the work and/or oversaw those who themselves did the work, so that is not really under dispute or ambiguous.
https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...

Reply View | 6 replies
- redox99 8 hours ago
  
  Books3 was used in Llama1. We don't know if they used it later on.
  
  Reply View | 5 replies
  
  aspenmayer 8 hours ago
  
  My comparison was illustrative and analogous in nature. The copyright cartel is making a fruit of the poisonous tree type of argument. Whatever Meta are doing with LLMs is doing the heavy lifting that parity files used to do back in the Usenet days. I wouldn’t be surprised if BitTorrent or other similar caching and distribution mechanisms incorporate AI/LLMs to recognize an owl on the wire, draw the rest just in time in transit, and just send the diffs, or something like that.
  The pictures are the same. All roads lead to Rome, so they say.
  
  Reply View | 0 replies
  
  aprilthird2021 7 hours ago
  
  All of the major AI models these days use "clean" datasets stripped of copyrighted material.
  They also use data from the previous models, so I'm not sure how "clean" it really is
  
  Reply View | 3 replies

gpm 8 hours ago

I think it's important to recognize here that fanfiction.net has 850 thousand distinct pieces of Harry Potter fanction on it. Fifty thousand of which are more than 40k words in length. Many of which (no easy way to measure) directly reproducing parts of the original books.

archiveofourown.org has 500 thousand, some, but probably not the majority, of that are duplicated from fanfiction.net. 37 thousand of these are over 40 thousand words.

I.e. harry potter and its derivatives presumably appear a million times in the training set, and its hard to imagine a model that could discuss this cultural phenomena well without knowing quite a bit about the source material.

Reply View 5 replies

aprilthird2021 7 hours ago

Did you read the article? This exact point is made and then analyzed.
> Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.
> “If it were citations and quotations, you'd expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

Reply View | 4 replies
- gpm 7 hours ago
  
  The article fails to mention or understand the volume of content here. Every, literally every, part of these books is quoted and "talked about" (in the sense of used in unlicensed derivative works).
  And yes, I read the article before commenting. I don't appreciate the baseless insinuation to the contrary.
  
  Reply View | 3 replies
  
  1123581321 7 hours ago
  
  Agreed. It’s an obtuse quote by Lemley who can’t picture the enormous quantity of associations and crawled data, or at least wants to minimize the quantity. It’s hardly discussion-ending.
  Accusations of not reading the article are fair when someone brings up a “related” anecdote that was in the article. It’s not fair when someone is just disagreeing.
  
  Reply View | 0 replies
  
  davidcbc 7 hours ago
  
  Even assuming you are correct, which I'm skeptical of, does this make it better?
  It's essentially the same thing, they are copying from a source that is violating copyright, whether that's a pirated book directly or a pirated book via fanficton.
  
  Reply View | 1 reply
  
  gpm 7 hours ago
  
  Generally I think it matters a great deal to get the facts right when discussing something with nuance.
  Is this specific fact required to make my beliefs consistent... Yes I think it is, but if you disagree with me in other ways it might not be important to your beliefs.
  Legally (note: not a lawyer) I'm generally of the opinion that
  A) Torrenting these books was probably copyright infringement on Meta's part. They should have done so legally by scanning lawfully acquired copies like Google did with Google Books.
  B) Everything else here that Meta did falls under the fair use and de minimis exceptions to copyrights prohibition on copying copyrighted works without a license.
  And if it was copying significant amounts of a work that appeared only once in its training set into the model the de minimis argument would fall apart.
  Morally I'm of the opinion that copyright law's prohibition on deeply interacting with our cultural artifacts by creating derivative works is incredibly unfair and bad for society. This extends to a belief that the communities that do this should not be excluded from technological developments because there entire existence is unjustly outlawed.
  Incidentally I don't believe that browsing a site that complies with the DMCA and viewing what it lawfully serves you constitutes piracy, so I can't agree with your characterization of events either. The fanfiction was not pirated just because it was likely unlawful to produce in the US.
  
  Reply View | 0 replies

asciisnowman 8 hours ago

On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer's Stone.

It's sold 120 million copies over 30 years. I've gotta think literally every passage is quoted online somewhere else a bunch of times. You could probably stitch together the full book quote-by-quote.

Reply View 7 replies

davidcbc 7 hours ago

If I collect HP quotes from the internet and then stitch them together into a book, can I legally sell access it?

Reply View | 0 replies
bitmasher9 8 hours ago

Probably not?
Sure there are just ~75,000 words in HP1, and there are probably many times that amount in direct quotes online. However the quotes aren’t even distributed across the entire text. For every quote of charming the snake in a zoo there will be a thousand “you’re a wizard harry”, and those are two prominent plot points.
I suspect the least popular of all direct quotes from HP1 aren’t using the quotes in fair use, and are just replicating large sections of the novel.
Or maybe it really is just so popular that super nerds have quoted the entire novel arguing about the aspects of wand making, or the contents of every lecture.

Reply View | 0 replies
tjpnz 6 hours ago

How many could do it from memory?

Reply View | 0 replies
mvdtnz 8 hours ago

But also we know for a fact that Meta trained their models on pirated books. So there's no need to invent a hare brained scheme of stitching together bits and pieces like that.

Reply View | 3 replies
- kouteiheika 5 hours ago
  
  No, assuming that just because it was in the training data it must be memorized is hare brained.
  LLMs have limited capacity to memorize, under ~4 bits per parameter[1][2], and are trained on terabytes of data. It's physically impossible for them to memorize everything they're trained on. The model memorized chunks of Harry Potter not just because it was directly trained on the whole book, which the article also alludes to:
  > For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.
  In case it isn't obvious, both Harry Potter and Sandman Slim are parts of books3 dataset.
  [1] -- https://arxiv.org/abs/2505.24832 [2] -- https://arxiv.org/abs/2404.05405
  
  Reply View | 2 replies
  
  mvdtnz 4 hours ago
  
  No, we know it because it was established in court from Meta internal communications.
  https://www.theguardian.com/technology/2025/jan/10/mark-zuck...
  
  Reply View | 1 reply
  
  kouteiheika 3 hours ago
  
  I'm confused. Nowhere in my post have I said that they didn't?
  
  Reply View | 0 replies

briffid 6 hours ago

Quotation is fair use in all sensible copyright system. An LLM will mostly be able to quote anything, and should be. Quotation is not derived work. LLMs are not stealing copyrighted work. They just show that Harry Potter is in English and a mostly logical story. If someone is stabbed, they will die in most stories, that's not copyrightable. If you have an engine that knows everything, it will be able to quote everything.

Reply View 0 replies

concats 3 hours ago

That's a clickbait title.

What they are actually saying: Given one correct quoted sentence, the model has 42% chance of predicting the next sentence correctly.

So, assuming you start with the first sentence and tell it to keep going, it has a 0.42^n odds of staying on track, where n is the n-th sentence.

It seems to me, that if they didn't keep correcting it over and over again with real quotes, it wouldn't even get to the end of the first page without descending into wild fanfiction territory, with errors accumulating and growing as the length of the text progressed.

EDIT: As the article states, for an entire 50 token excerpt to be correct the probability of each output has to be fairly high. So perhaps it would be more accurate to view it as 0.985^n where n is the n-th token. Still the same result long term. Unless every token is correct, it will stray further and further from the correct source.

Reply View 3 replies

fennecfoxy 33 minutes ago

You're right, and the person who already commented is being facetious. A better title would be "Meta's Llama 3.1 can recall the next sentence in the First Harry Potter book with 42% accuracy". The title intentionally makes it seem as though the model can predict the first 42% of the entire text of the first Harry Potter book when queried with something like "Read me Harry Potter and the Philosopher's stone".

Reply View | 0 replies
7bit 3 hours ago

What would be a better title? You're correct that the title isn't accurate, however, click bait? I wouldn't say so. But I'm lacking imagination to find a better one. Interested to hear your suggestion.

Reply View | 0 replies
7bit 3 hours ago

What would be a better title? You're correct that the title isn't accurate, however, click bait? I wouldn't say so. But I'm lacking imagination to find a better one. Interested to hear your suggestion.

Reply View | 0 replies

dankwizard 8 hours ago

I can recall about 12% of the first Harry Potter book so it's interesting to see Llama is only 4x smarter than me. I will catch up.

Reply View 2 replies

hsbauauvhabzb 8 hours ago

How many r’s are there in strawberry?

Reply View | 1 reply
- jofzar 8 hours ago
  
  There are 3 R's in strawberry just like in Harry Potter!
  
  Reply View | 0 replies

graphememes 8 hours ago

I really wish we could get rid of copyright. It's going to hold us back long term.

Reply View 10 replies

bitmasher9 8 hours ago

We cannot get ride of it without finding a way to pay the creators that generate copyrighted works.
I’m personally more in favor of significantly reducing the length of the copy right. I think 20-30 years is an interesting range. Artist get roughly a career length of time to profit off their creations, but there is much less incentive for major corporations to buy and horde IP.

Reply View | 7 replies
- atrus 8 hours ago
  
  We barely pay creators as it is for generating copyrighted works. Nearly every copywritten work is available on the internet, for free, right now. And creators are still getting paid, albeit poorly, but that's a constant throughout history.
  
  Reply View | 3 replies
  
  jeroenhd 4 hours ago
  
  The thing about creators is that most of them are paid extremely poorly, and some of them get insanely rich. Joanne Rowling has received more money than a reasonable person could use for her wizard books, but millions of bloggers feeding much more data into AI training sets will never see a cent for their work. For starting authors selling books, this can easily be the difference between writing another book or giving up and taking up another job.
  At the moment, there's also a huge difference between who does and who doesn't pay. If I put the HP collection on my website, you betcha Joanne Rowling's team is going to try to take it down. However, because OpenAI designed an AI system where content cannot be removed from its knowledge base and because their pockets are lined with cash for lawyers, it's practically free to violate whatever copyright rules it wants.
  
  Reply View | 1 reply
  
  AStonesThrow 4 hours ago
  
  [dead]
  
  Reply View | 0 replies
  
  Tepix 6 hours ago
  
  How does that favor a longer copyright? It’s not like these old works make a lot of money (with very few exceptions). And making money after 30 years is hardly a motivating factor.
  
  Reply View | 0 replies
- [removed] 7 hours ago
  
  [deleted]
  
  Reply View | 0 replies
- jMyles 6 hours ago
  
  I do not think it's creators that are the constituency holding up deprecation.
  As a full-time professional musician, I'm convinced I'll benefit much more from its deprecation than continuing to flog it into posterity. I don't think I know any musicians who believe that IP is career-relevant for them at this point.
  (Granted, I play bluegrass, which has never fit into the copyright model of music in the first place)
  
  Reply View | 0 replies
- [removed] 6 hours ago
  
  [deleted]
  
  Reply View | 0 replies
JoshTriplett 8 hours ago

I do too. But in the meantime, as long as it continues being used against anyone, it should be applied fairly. As long as anyone has to respect software licenses, for instance, then AIs should too. It doesn't stop being a problem just because it's done at larger scale.

Reply View | 0 replies
numpad0 7 hours ago

Sure, you just get constantly sued for obstruction of business instead, and there will be no fair use clauses, free software licenses, or right to repair to fight back. It'll be all proprietary under NDA. Is that what you want?

Reply View | 0 replies

cowbolt 3 hours ago

Imagine the literary possibilities when it can write 100%! Rowling's original work was an amusing, if rather derivative children's book. But Llama's version of the Philosophers stone will be something else entirely. Just think of the rather heavy-handed Cerberus reference in the original work. Instead of a rote reference to Greek mythology used as a simple trope, it will be filled with a subtext that only an LLM can produce.

Right now they're working on recreating the famous sequence with the troll in the dungeon. It might cost them another few billion in training, but the end results will speak for themselves.

Reply View 0 replies

fennecfoxy an hour ago

I mean it makes sense. Same thing as George RR Martin complaining that it can spit out chunks of his books (finish your books already!!)

As I have pointed out many times before - for GRRM's books and for HP books, the Internet is FILLED to the brim with quotes from these books, there are uploads of the entire books, there are several (not just one) fan wikis for each of these fandoms. There is a lot of content in general on the Internet that quotes these books, they are pop culture sensations.

So of course they're weighted heavily when training an LLM by just feeding it the Internet. If a model could ever recount it correctly 100% in the correct order, then that's overfitting. But otherwise it's just plain & simple high occurrence in training data.

Reply View 0 replies

flowerthoughts 5 hours ago

If LLMs are good at summarizing/compressing, what does this say about the underlying text? Why are some passages more easily recalled? Sure, some sections have probably been quoted more times than others, so there's bias in training data, which might explain why the Llama 1 and 3.1 images have similar peaks. Would this happen to LLMs even with no training bias?

Edit: seems the first part is about a memory about being bullied by Duddley. The second is where he's been elected to the quidditch team. Possibly they are just boring passages, compared to the surrounding ones. So probably just training bias.

Reply View 0 replies

Javantea_ 7 hours ago

I'm surprised no one in the comments has mentioned overfitting. Perhaps this is too obvious but I think of it as a very clear bug in a model if it asserts something to be true because it has heard it once. I realize that training a model is not easy, but this is something that should've been caught before it was released. Either QA is sleeping on the job or they have intentionally released a model with serious flaws in its design/training. I also understand the intense pressure to release early and often, but this type of thing isn't a warning.

Reply View 4 replies

jeroenhd 4 hours ago

Overfitting makes for more human-like output (because it's repeating words written by a human). Out of all possible failure states of a model, overfitting is probably what you want out of an LLM, as long as it's not overfitted enough to lose lawsuits.

Reply View | 1 reply
- fennecfoxy 31 minutes ago
  
  I disagree. I'd include overfitting for LLMs as creating unreasonably strong connections to individual sequences used for training, whereas a good mix of that and connections between chunks of those sequences are required.
  
  Reply View | 0 replies
numpad0 6 hours ago

It's apparently known among LLM researchers that the best epoch count for LLM training is one. They go through the entire dataset once, and that makes best LLMs.
They know. LLM is a novel compression format for text(holographic memory or whatever). The question is whether the rest of the world accept this technology as it is or not.

Reply View | 0 replies
Tepix 6 hours ago

I think part of the problem is that the book is in the training set multiple times

Reply View | 0 replies

Machado117 3 hours ago

Do LLMs have any perception that Harry Potter is fiction or is it possible that they will give some magical advice based on fiction works that they have been trained with?

edit: never mind, I’ll just ask ChatGPT

Reply View 1 reply

otabdeveloper4 3 hours ago

LLMs don't have "perception" at all, they only ever output a likely text completion token.

Reply View | 0 replies

whitehexagon 3 hours ago

I wonder what percentage we could expect from a true general AI, 100% ?

It would be nice to know that at least our literature might survive the technological singularity.

Reply View 0 replies

bradley13 7 hours ago

Many people could also produce text snippets from memory. I dispute that reading a book is a copyright violation. Copying and distributing a book, yes, but just reading it - no.

If the book was obtained legitimately, letting an LLM read it is not an issue.

Reply View 1 reply

riffraff 6 hours ago

It is well reported that meta (and open ai and basically everyone) trained on contained obtained via piracy (LibGen).

Reply View | 0 replies

BUFU 6 hours ago

Would it be possible that other people posted content of Harry Potter book online and the model developer scrape that information? Would the model developer be at fault in this scenario?

Reply View 1 reply

timeon 6 hours ago

I think this is good question. At least for LLMs in general. However we know that Meta used pirated torrents.

Reply View | 0 replies

htk 8 hours ago

Hmm, couldn't this be used as a benchmark for quantization algorithms?

Reply View 0 replies

choeger 6 hours ago

LLMs are to a certain degree compressed databases of their training data. But 42% is a surprisingly large number.

Reply View 0 replies

tikhonj 3 hours ago

Meta Llama, Author of Harry Potter

Reply View 0 replies

WhatsName 20 hours ago

Given the method and how the english language works, isn't that the expected outcome for any text that isnt highly technical?

Guess the next word: Not all heros wear _____

Reply View 1 reply

aspenmayer 18 hours ago

As there is no reason to believe that Harry Potter is axiomatic to our culture in the way that other concepts are, it is strange to me that the LLMs are able to respond in this way, and not at all expected. Why do you think this outcome is expected? Are the LLMs somehow encoding the same content in such a way that they can be prompted to decode it? Does it matter legally how LLMs are doing what they do technically? This is pertinent to the court case that Meta is currently party to.
https://en.wikipedia.org/wiki/Artificial_intelligence_and_co...
> See for example OpenAI's comment in the year of GPT-2's release: OpenAI (2019). Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation (PDF) (Report). United States Patent and Trademark Office. p. 9. PTO–C–2019–0038. “Well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus”
https://copyrightalliance.org/kadrey-v-meta-hearing/
> During the hearing, Judge Chhabria said that he would not take into account AI licensing markets when considering market harm under the fourth factor, indicating that AI licensing is too “circular.” What he meant is that if AI training qualifies as fair use, then there is no need to license and therefore no harmful market effect.
I know this is arguing against the point that this copyright lobbyist is making, but I hope so much that this is the case. The “if you sample, you must license” precedent was bad, and it was an unfair taking from the commons by copyright holders, imo.
The paper this post is referencing is freely available:
https://arxiv.org/abs/2505.12546

Reply View | 0 replies

[removed] 3 hours ago

[deleted]

Reply View 0 replies

evertedsphere 9 hours ago

what is that bar (= token span) on the right common to the first three models

Reply View 0 replies

deafpolygon a day ago

It will generate a correct next token 42% of the time when prompted with a 50 token quote.

Not 42% of the book.

It's a pretty big distinction.

Reply View 7 replies

j16sdiz 8 hours ago

next _50_ tokens 42% of the time
not just next token.
This is like: tell it a random sentence in the book, it will give you the next sentence 42% of time.

Reply View | 0 replies
deviation a day ago

A... massive distinction.

Reply View | 0 replies
[removed] 8 hours ago

[deleted]

Reply View | 0 replies
asplake a day ago

“… well enough to reproduce 50-token excerpts at least half the time”

Reply View | 0 replies
chiph2o 17 hours ago

This means that if we start with 50% of the book then there is 42% chance that we can recreate the remaining 50%.
What is the distinction between understanding and memorization? What is the chance that understanding results in memorization (may be in case of humans)?

Reply View | 2 replies
- [removed] 8 hours ago
  
  [deleted]
  
  Reply View | 0 replies
- ipaddr 6 hours ago
  
  It stores how often characters will come next based on how often they happen in copyright material. It can reproduce parts because those values are a fingerprint.
  It should break copyright laws as written now but too much money involved.
  
  Reply View | 0 replies

gamblor956 5 hours ago

It's not fair use just because you guys want it be fair use.

While limited quoting can (and usually is) considered fair use, quoting significant portions of a book (much less 42% of it) has never been fair use, in the U.S., Europe, or any other nation.

Yes, information wants to be free, yada yada. That means facts. Whether creative works are free is up to their creators.

Reply View 0 replies

curiousgal 5 hours ago

[flagged]

Reply View 1 reply

tomhow 4 hours ago

Please don't do this here. If you're going to use the word, use the word, but also, please don't use words like that about people here, no matter what you think of them. A comment like this breaks multiple guidelines:
https://news.ycombinator.com/newsguidelines.html
We detached this comment from https://news.ycombinator.com/item?id=44287156 and marked it off topic.

Reply View | 0 replies

aspenmayer a day ago

https://archive.is/OSQt6

If you've seen as many magnet links as I have, with your subconscious similarly primed with the foreknowledge of Meta having used torrents to download/leech (and possibly upload/seed) the dataset(s) to train their LLMs, you might scroll down to see the first picture in this article from the source paper, and find uncanny the resemblance of the chart depicted to a common visual representation of torrent block download status.

Can't unsee it. For comparison (note the circled part):

https://superuser.com/questions/366212/what-do-all-these-dow...

Previously, related:

Extracting memorized pieces of books from open-weight language models - https://news.ycombinator.com/item?id=44108926 - May 2025

Reply View 1 reply

[removed] a day ago

[deleted]

Reply View | 0 replies

bjornsing 7 hours ago

It’s well-known that John von Neumann had this ability too:

Herman Goldstine wrote "One of his remarkable abilities was his power of absolute recall. As far as I could tell, von Neumann was able on once reading a book or article to quote it back verbatim; moreover, he could do it years later without hesitation. He could also translate it at no diminution in speed from its original language into English. On one occasion I tested his ability by asking him to tell me how A Tale of Two Cities started. Whereupon, without any pause, he immediately began to recite the first chapter and continued until asked to stop after about ten or fifteen minutes."

Maybe it’s just an unavoidable side effect of extreme intelligence?

Reply View 0 replies

giardini 9 hours ago

As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.

Reply View 31 replies

esafak 9 hours ago

That's got nothing to do with it. It's all about copyright. Can it reproduce its training data verbatim? If so, Meta is in hot water.

Reply View | 7 replies
- strangescript 8 hours ago
  
  I read harry potter, and you ask me about a page, and I can recite it verbatim, did I just commit copyright infringement?
  
  Reply View | 6 replies
  
  lucianbr 8 hours ago
  
  Are you selling your ability to recite stuff? Then certainly.
  
  Reply View | 2 replies
  
  bitmasher9 8 hours ago
  
  I pay for a service. The service recites a novel to me. The service would need permission to do this or it is copyright infringement.
  
  Reply View | 0 replies
  
  [removed] 8 hours ago
  
  [deleted]
  
  Reply View | 0 replies
  
  __loam 8 hours ago
  
  This is an extremely common strawman argument. We're not discussing human memory.
  
  Reply View | 0 replies
Jap2-0 8 hours ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere
To address this point, and not other concerns: the benefits would be (1) pop culture knowledge and (2) having a variety of styles of edited/reasonably good-quality prose.

Reply View | 0 replies
alephnerd 9 hours ago

> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere
It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?
> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder
Plenty of in-stealth companies approaching LLMs via this approach ;)
For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).

Reply View | 21 replies
- epgui 9 hours ago
  
  > It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?
  Personally I’m assuming the worst.
  That being said, Harry Potter was such a big cultural phenomenon that I wonder to what degree might one actually be able to reconstruct the books based solely on publicly accessible derivative material.
  
  Reply View | 0 replies
- weird-eye-issue 9 hours ago
  
  Why are you talking about Claude and Anthropic?
  
  Reply View | 2 replies
  
  cshimmin 8 hours ago
  
  It’s not unreasonable to suspect they are doing the same. The article starts with a description of a lawsuit NY Times brought against OpenAI for similar reasons. The big difference is that research presented here is only possible with open weight models. OAI and Anthropic don’t make the base models available, so it’s easier to hide the fact that you’ve used copyrighted material by instruction post-training. And I’m not sure you can get the logprobs for specific tokens from their APIs either (which is what the researchers did to make the figures and come up with a concrete number like 42%)
  
  Reply View | 0 replies
  
  alephnerd 7 hours ago
  
  Good call! I brain farted and wrote Claude/Anthropic instead of Meta/Llama.
  
  Reply View | 0 replies
- ninetyninenine 9 hours ago
  
  So if I memorized Harry Potter the physical encoding which definitely exists in my brain is a copyright violation?
  
  Reply View | 16 replies
  
  dvt 8 hours ago
  
  > the physical encoding which definitely exists in my brain is a copyright violation
  First of all, we don't really know how the brain works. I get that you're being a snarky physicalist, but there's plenty of substance dualists, panpsychsts, etc. out there. So, some might say, this is a reductive description of what happens in our brains.
  Second of all, yes, if you tried to publish Harry Potter (even if it was from memory), you would get in trouble for copyright violation.
  
  Reply View | 8 replies
  
  lithiumii 9 hours ago
  
  You are not selling or distributing copies of your brain.
  
  Reply View | 0 replies
  
  harry8 9 hours ago
  
  If you perform it from memory in public without paying royalties then yes, yes it is.
  Should it be? Different question.
  
  Reply View | 0 replies
  
  JKCalhoun 8 hours ago
  
  The end of "Fahrenheit 451" set a horrible precedent. Damn you, Bradbury!
  
  Reply View | 0 replies
  
  beowulfey 8 hours ago
  
  Only if you charge someone to reproduce it for them
  
  Reply View | 0 replies
  
  shrewduser 9 hours ago
  
  maybe if you re wrote it from memory.
  
  Reply View | 0 replies
  
  teaearlgraycold 9 hours ago
  
  I think humans get a special exception in cases like this
  
  Reply View | 1 reply
  
  otabdeveloper4 3 hours ago
  
  No they don't. Commercial intent is what is prosecuted in IP law.
  
  Reply View | 0 replies