Comment by thomastjeffery

An LLM is not a database. There is no significant amount of information in a model that can be accessed 100% of the time. This is because it's a mystery to the user what collection of tokens will lead to a specific output. To get a predictable result from an LLM 50% of the time is very significant.

This doesn't tell us for certain whether or not the model was trained on a full copy of the book. It's possible that 50-token long passages from 42% of the book were, incidentally, quoted verbatim in various parts of the training data. Considering the popularity of both the book itself, and derivative fan-fiction, I would not be surprised. I would be less surprised to learn that it was indeed trained on a full copy of the book, if not several.

The more meaningful point here is that the ability to reproduce half a book is the same sort of overt derivative work that is definitely considered copyright infringement in other circumstances. A lossy copy is still a copy. If we are to hold LLMs to the same standard as other content, this isn't very easy to defend.

Personally, I see this as a good opportunity to reevaluate copyright on the whole. I think we would be better off without it.