Comment by raincole
(Disclaimer: haven't read the original paper)
It sounds like a ridiculous way to measure it. Producing 50-token excerpts absolutely doesn't translate to "recall X percent of Harry Potter" for me.
(Edit: I read this article. Nothing burger if its interpretation of the original paper is correct.)
Their methodology seems reasonable to me.
To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.
Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.