Comment by weinzierl
Comment by weinzierl 2 days ago
Isn't it the other way around?
SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.
LLM content should just enhance and cement the status quo word frequencies.
Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.
But you can already see it with Delve. Mistral uses "delve" more than baseline, because it was trained on GPT.
So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...
Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).
Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.