Comment by water-data-dude
Comment by water-data-dude a day ago
It'd be difficult to prove that you hadn't leaked information to the model. The big gotcha of LLMs is that you train them on BIG corpuses of data, which means it's hard to say "X isn't in this corpus", or "this corpus only contains Y". You could TRY to assemble a set of training data that only contains text from before a certain date, but it'd be tricky as heck to be SURE about it.
Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.
There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?