Comment by zmmmmm

Comment by zmmmmm 17 hours ago

31 replies

It's important to note the way it was measured:

> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time

As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.

So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.

Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.

Aurornis 17 hours ago

> So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own.

That’s what I was thinking as I read the methodology.

If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?

jxjnskkzxxhx 6 hours ago

Suppose for simplicity that every sentence in the book is 50 tokens or shorter.

According to the stated methodology, I could give the LLM sentence 1 and have 42% chance of getting sentence 2 recalled. Then I could give it sentence 2 and have 42% chance of getting sentence 3. Therefore, the LLM contains 42% of the book in some sense.

I disagree this is "not really very much". If a person could do this you would undoubtedly conclude that the person read the book.

In fact the number 42% even understates the severity of the matter. Superficially it makes it sound that the LLM only contains less than half of the book. In reality the process I described applies to 100% of the sentences. Additionally I'm guessing that the 58% times where the 50 tokens arent recalled correctly, the outputted token probably have the same meaning as the correct one.

  • TeMPOraL 4 hours ago

    Except it's not what happened, per the article. Instead, they walked down the logits, which is more like asking someone to give 10-20 best guesses for next word, and should one of them match the secret answer, telling them which one is it and asking them to go on with the next word. Seems like a substantially easier task, and most of information is coming from researchers making a choice at every step.

vintermann 16 hours ago

All this study really says, is that models are really good at compressing the text of Harry Potter. You can't get Harry Potter out of it without prompting it with the missing bits - sure, impressively few bits, but is that surprising, considering how many references and fair use excerpts (like discussion of the story in public forums) it's seen?

There's also the question of how many bits of originality there actually are in Harry Potter. If trained strictly on text up to the publishing of the first book, how well would it compress it?

  • fiddlerwoaroof 15 hours ago

    The alternate here is that Harry Potter is written with sentences that match the typical patterns of English and so, when you prompt with a part of the text, the LLM can complete it with above-random accuracy

    • vintermann 15 hours ago

      Anything that can tell you what the typical patterns of English is, is going to be a language model by definition.

      • fiddlerwoaroof 15 hours ago

        My point is that this might just prove that Harry Potter is the sort of prose “fancy autocomplete” would produce and not all that original.

        EDIT Actually, on rereading, I see I replied to the wrong comment.

    • fiddlerwoaroof 15 hours ago

      Or else, LLMs show that copyright and IP are ridiculous concepts that should be abolished

bee_rider 17 hours ago

Even if it is recalling it 50 tokens at a time, the half of the book is in some sense in there, right?

  • everforward 14 hours ago

    I don’t think this paper proves that, and I don’t think it is in a traditional sense.

    It can produce the next sentence or two, but I suspect it can’t reproduce anything like the whole text. If you were to recursively ask for the next 50 tokens, the first time it’s wrong the output would probably cease matching because you fed it not-Harry-Potter.

    It seems like chopping Harry Potter up into 2 sentences at a time on post it’s and tossing those in the air. It does contain Harry Potter, in a way, but without the structure is it actually Harry Potter?

  • kelipso 5 hours ago

    Almost the entire book is in there. From the paper, if you give it a 100 token prompt, it will produce the next 50 tokens with more than 1% probability so that the produced tokens cover 91% of the book. And as the title says, it also produces next 50 tokens with more than 50% probability, so produced tokens cover 42% of the book. Bet it gets close to 100% as you reduce the probability.

    Also they went through the book at 10 token strides. Like..a bit tortured way to reproduce the book (basically impossible to actually reproduce the book) but it shows that the content is in there.

    Now whether this is derivative work, copyright violation or whatever is debatable. Probably gets similar numbers for a bunch of other books too. They should have done the Bible and probably get way higher numbers, but that won’t go viral.

    • bee_rider 3 hours ago

      I think I agree with this take. The book is in there in some sense, whether or not it is a copyright violation is debatable.

      Honestly, I get why these debates happen—it is practical to establish whether or not this emerging tech is illegal under current law. But it’s also like… well, obviously current law wasn’t written with this sort of application in mind.

      Whether or not we think LLMs are basically good or bad, they are clearly quite impactful. It would be a nice time to have a functional legislature to address this directly.

  • TeMPOraL 8 hours ago

    Not necessarily. Information is always spread between what we'd normally consider "storage medium" and "reader"; the degree to which that is is a controllable parameter.

    Consider e.g.:

    - Digital expansion of PI to sufficient decimal places contains both parts of the work and full work in full. The trick is you have to know where to find it - and it's that knowledge that's actually equivalent to the work itself.

    - Any kind of compression that uses a dictionary that's separate from the compressed artifact, shifts some of the information into a dictionary file, or if it's a common dictionary, into compressor/decompressor itself.

    In the case from the study, the experimenter actually has to supply most of the information required to pull Harry Potter out of the model - they need to make specific prompts with quotes from the book, and then observe which logits correspond to the actual continuation of those quotes. The experimenter is doing information-loaded selection multiple times: at prompting, and at identifying logits. This by itself doesn't really prove the model memorized the book, only just that it saw fragments from it - in cases those fragments are book-specific (e.g. using proper names from the HP world) instead of generic English sentences.

  • zmmmmm 16 hours ago

    yeah ... it's going to depend how the issue is framed. However a "copy" of something where there is no way to practically extract the original from it probably has a pretty good argument that it's not really a "copy". For example, a regular dictionary probably has 99% of harry potter in it. Is it a copy?

  • vintermann 16 hours ago

    I'd say no. More than half of as-yet unwritten books will be in there too, because I bet will will compress text of a freshly published book much better than 50% (and newer models could even compress new books to one fiftieth of their size, which is more like that 1 in 50 tokens suggests)

    • bee_rider 15 hours ago

      That seems like a reasonably easy test to run, right? All you need is a bit of prose that was known not to have been written beforehand. Actually, the experiment could be run using the paper itself!

om8 7 hours ago

> 50 tokens is not really very much Yes! And also llama3.1’s tokens are different from Qwen and llama1 tokens. That’s the first model where meta started to use very large vocab_size.

adrianN 17 hours ago

Fair use is not a thing in every jurisdiction. In Germany for example there are cases where three words („wir sind Papst“) fall under copyright.

  • yorwba 15 hours ago

    Germany does not have something called "fair use," but it does have provisions for uses that are fair. For example your use of the three words to talk about their copyrighted status is perfectly legal in Germany. That somebody wasn't allowed to use them in a specific way in the past doesn't mean that nobody is allowed to use them in any way.

    • adrianN 14 hours ago

      Of course, but „it’s a short quote so you can use it“ is not true (at least in Germany).

      • yorwba 13 hours ago

        To be pedantic, short quotes (as opposed to short copied fragments that are not used as quotes) are explicitly one of the allowed uses (Zitierbefugnis). You can even quote entire works "in an independent scientific work for the purpose of explaining its content"! https://www.gesetze-im-internet.de/englisch_urhg/englisch_ur...

        Generally speaking, exceptions to copyright are based on the appropriateness of the amount of copied content for the given allowed use, so the shorter it is, the more likely it is for copying to be permitted. European copyright law isn't much different from fair use in that respect.

        Where it does differ is that the allowed uses are more explicitly enumerated. So Meta would have to argue e.g. based on the exception for scientific works specifically, rather than more general principles.

seydor 12 hours ago

The claim of the paper is not so much that the model is reproducing content illegally but that harry Potter has been used to train the model.

This does not appear to happen with other models they tested to the same degree

amanaplanacanal 17 hours ago

Fair use is a four part test, and the amount if copying is only one of the four parts.

xnx 17 hours ago

This sounds almost like "Works every time (50% of the time)."

  • hsbauauvhabzb 17 hours ago

    Except the odds of it happening even 50% of the time is less likely than winning the lottery multiple times. All while illegally ingesting copywrite material without (and presumably against the wishes of) the consent of the copywrite holder.

thomastjeffery 5 hours ago

An LLM is not a database. There is no significant amount of information in a model that can be accessed 100% of the time. This is because it's a mystery to the user what collection of tokens will lead to a specific output. To get a predictable result from an LLM 50% of the time is very significant.

This doesn't tell us for certain whether or not the model was trained on a full copy of the book. It's possible that 50-token long passages from 42% of the book were, incidentally, quoted verbatim in various parts of the training data. Considering the popularity of both the book itself, and derivative fan-fiction, I would not be surprised. I would be less surprised to learn that it was indeed trained on a full copy of the book, if not several.

The more meaningful point here is that the ability to reproduce half a book is the same sort of overt derivative work that is definitely considered copyright infringement in other circumstances. A lossy copy is still a copy. If we are to hold LLMs to the same standard as other content, this isn't very easy to defend.

Personally, I see this as a good opportunity to reevaluate copyright on the whole. I think we would be better off without it.

raincole 17 hours ago

(Disclaimer: haven't read the original paper)

It sounds like a ridiculous way to measure it. Producing 50-token excerpts absolutely doesn't translate to "recall X percent of Harry Potter" for me.

(Edit: I read this article. Nothing burger if its interpretation of the original paper is correct.)

  • tanaros 16 hours ago

    Their methodology seems reasonable to me.

    To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.

    Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.

    • raincole 12 hours ago

      > one of those tricky semantic arguments we have yet to settle when it comes to LLMs

      Sure. But imagine this: In a hypothetical world where LLMs never ever exist, I tell you that I can recall 42 percent of the first Harry Potter book. What would you assume I can do?

      It's definitely not "this guy can predict next 10 characters with 50% accuracy."

      Of course the semantic of 'recall' isn't the point of this article. The point is that Harry Potter was in the training set. But I still think it's a nothing burger. It would be very weird to assume Llama was trained on copyright-free materials only. And afaik there isn't a legal precedent saying training on copyrighted materials is illegal.