Comment by khalic
> Villalobos et al. [75] project that frontier LLMs will be trained on all publicly available human-generated text by 2028. We argue that this impending “data wall” will necessitate the adoption of synthetic data augmentation. Once web-scale corpora is exhausted, progress will hinge on a model’s capacity to generate its own high-utility training signal. A natural next step is to meta-train a dedicated SEAL synthetic-data generator model that produces fresh pretraining corpora, allowing future models to scale and achieve greater data efficiency without relying on additional human text.
2028 is pretty much tomorrow… fascinating insight
It's just a theory, nothing more. A single human brain is vastly more complex than the whole web, in terms of nodes and connections between them. We don't even understand enough about the brain to explain how we think. We don't fully understand how a brain makes its output, before sending it onto the web. Projecting, that models will be able to create any useful training data themselves after web scale is just a guess. Such training data may never be of the same quality as a human thought. It may just be regurgitating stuff and not furthering the learning or the model quality at all. Calling that idea an "insight" is a bit too optimistic.