Comment by tanaros

Comment by tanaros 18 hours ago

1 reply

Their methodology seems reasonable to me.

To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.

Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.

raincole 14 hours ago

> one of those tricky semantic arguments we have yet to settle when it comes to LLMs

Sure. But imagine this: In a hypothetical world where LLMs never ever exist, I tell you that I can recall 42 percent of the first Harry Potter book. What would you assume I can do?

It's definitely not "this guy can predict next 10 characters with 50% accuracy."

Of course the semantic of 'recall' isn't the point of this article. The point is that Harry Potter was in the training set. But I still think it's a nothing burger. It would be very weird to assume Llama was trained on copyright-free materials only. And afaik there isn't a legal precedent saying training on copyrighted materials is illegal.