Comment by bane
Comment by bane 2 days ago
This is one of the vanguards warning of the changes coming in the post-AI world.
>> Generative AI has polluted the data
Just like low-background steel marks the break in history from before and after the nuclear age, these types of data mark the distinction from before and after AI.
Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn. Meanwhile certain low-frequency data will be selected by these models less and less and will become suppressed and possibly eliminated. We know from classic NLP techniques that low frequency words are often among the highest in information content and descriptive power.
Bitrot will continue to act as the agent of Entropy further reducing pre-AI datasets.
These feedback loops will persist, language will be ground down, neologisms will be prevented and...society, no longer with the mental tools to describe changing circumstances; new thoughts unable to be realized, will cease to advance and then regress.
Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.
The glimmer of hope is that humanity figured out how to rise from the dreamstate of the world of animals once. Future humans will be able to climb from the ashes again. There used to be a word, the name of a bird, that encoded this ability to die and return again, but that name is already lost to the machines that will take our tongues.
> Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn.
That's why on FB I mark my own writing as AI generated, and the AI generated slop as genuine. Because what is disguised as "transparency disclaimer" is just flagging content of what's a potential dataset to train from and what isn't.