Comment by jaydepun
We've thought of doing this sort of exercise at work but mostly hit the wall of data becoming a lot more scare the further back in time we go. Particularly high quality science data - even going pre 1970 (and that's already a stretch) you lose a lot of information. There's a triple whammy of data still existing, being accessible in any format, and that format being suitable for training an LLM. Then there's the complications of wanting additional model capabilities that won't leak data causally.
I was wondering this. what is the minimum amount of text an LLM needs to be coherent? fun of an idea as this is, the samples of its responses are basically babbling nonsense. going further, a lot of what makes LLMs so strong isn't their original training data, but the RLHF done afterwards. RLHF would be very difficult in this case