everforward 14 hours ago

I don’t think this paper proves that, and I don’t think it is in a traditional sense.

It can produce the next sentence or two, but I suspect it can’t reproduce anything like the whole text. If you were to recursively ask for the next 50 tokens, the first time it’s wrong the output would probably cease matching because you fed it not-Harry-Potter.

It seems like chopping Harry Potter up into 2 sentences at a time on post it’s and tossing those in the air. It does contain Harry Potter, in a way, but without the structure is it actually Harry Potter?

kelipso 5 hours ago

Almost the entire book is in there. From the paper, if you give it a 100 token prompt, it will produce the next 50 tokens with more than 1% probability so that the produced tokens cover 91% of the book. And as the title says, it also produces next 50 tokens with more than 50% probability, so produced tokens cover 42% of the book. Bet it gets close to 100% as you reduce the probability.

Also they went through the book at 10 token strides. Like..a bit tortured way to reproduce the book (basically impossible to actually reproduce the book) but it shows that the content is in there.

Now whether this is derivative work, copyright violation or whatever is debatable. Probably gets similar numbers for a bunch of other books too. They should have done the Bible and probably get way higher numbers, but that won’t go viral.

  • bee_rider 3 hours ago

    I think I agree with this take. The book is in there in some sense, whether or not it is a copyright violation is debatable.

    Honestly, I get why these debates happen—it is practical to establish whether or not this emerging tech is illegal under current law. But it’s also like… well, obviously current law wasn’t written with this sort of application in mind.

    Whether or not we think LLMs are basically good or bad, they are clearly quite impactful. It would be a nice time to have a functional legislature to address this directly.

TeMPOraL 8 hours ago

Not necessarily. Information is always spread between what we'd normally consider "storage medium" and "reader"; the degree to which that is is a controllable parameter.

Consider e.g.:

- Digital expansion of PI to sufficient decimal places contains both parts of the work and full work in full. The trick is you have to know where to find it - and it's that knowledge that's actually equivalent to the work itself.

- Any kind of compression that uses a dictionary that's separate from the compressed artifact, shifts some of the information into a dictionary file, or if it's a common dictionary, into compressor/decompressor itself.

In the case from the study, the experimenter actually has to supply most of the information required to pull Harry Potter out of the model - they need to make specific prompts with quotes from the book, and then observe which logits correspond to the actual continuation of those quotes. The experimenter is doing information-loaded selection multiple times: at prompting, and at identifying logits. This by itself doesn't really prove the model memorized the book, only just that it saw fragments from it - in cases those fragments are book-specific (e.g. using proper names from the HP world) instead of generic English sentences.

zmmmmm 16 hours ago

yeah ... it's going to depend how the issue is framed. However a "copy" of something where there is no way to practically extract the original from it probably has a pretty good argument that it's not really a "copy". For example, a regular dictionary probably has 99% of harry potter in it. Is it a copy?

vintermann 16 hours ago

I'd say no. More than half of as-yet unwritten books will be in there too, because I bet will will compress text of a freshly published book much better than 50% (and newer models could even compress new books to one fiftieth of their size, which is more like that 1 in 50 tokens suggests)

  • bee_rider 15 hours ago

    That seems like a reasonably easy test to run, right? All you need is a bit of prose that was known not to have been written beforehand. Actually, the experiment could be run using the paper itself!