Comment by kelipso
Almost the entire book is in there. From the paper, if you give it a 100 token prompt, it will produce the next 50 tokens with more than 1% probability so that the produced tokens cover 91% of the book. And as the title says, it also produces next 50 tokens with more than 50% probability, so produced tokens cover 42% of the book. Bet it gets close to 100% as you reduce the probability.
Also they went through the book at 10 token strides. Like..a bit tortured way to reproduce the book (basically impossible to actually reproduce the book) but it shows that the content is in there.
Now whether this is derivative work, copyright violation or whatever is debatable. Probably gets similar numbers for a bunch of other books too. They should have done the Bible and probably get way higher numbers, but that won’t go viral.
I think I agree with this take. The book is in there in some sense, whether or not it is a copyright violation is debatable.
Honestly, I get why these debates happen—it is practical to establish whether or not this emerging tech is illegal under current law. But it’s also like… well, obviously current law wasn’t written with this sort of application in mind.
Whether or not we think LLMs are basically good or bad, they are clearly quite impactful. It would be a nice time to have a functional legislature to address this directly.