Comment by water-data-dude

It'd be difficult to prove that you hadn't leaked information to the model. The big gotcha of LLMs is that you train them on BIG corpuses of data, which means it's hard to say "X isn't in this corpus", or "this corpus only contains Y". You could TRY to assemble a set of training data that only contains text from before a certain date, but it'd be tricky as heck to be SURE about it.

Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.

gwern 21 hours ago

There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?

Reply View 1 reply

mxfh 20 hours ago

This.
Especially for London there is a huge chunk of recorded parliament debates.
More interesting for dialoge seems training on recorded correspondence in form of letters anyway.
And that corpus script just looks odd to say the least, just oversample by X?

Reply View | 0 replies

reassess_blind 9 hours ago

Just Ctrl+F the data. /s

Reply View 0 replies