Comment by PeterStuer
Comment by PeterStuer 2 days ago
Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?
Comment by PeterStuer 2 days ago
Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?
Yes, but the material presented in no way makes distiction between potential organic growth of 'delve' vs. LLM induced use. They just note that even though 'delve' was on the rise, in 23-24 the word gains more popularity, at the same time ChatGPT rose. Word adoption is certainly not a linear phenomenon. And as the author states 'I don't think anyone has reliable information about post-2021 language usage by humans'
So I would still state noun-phrase frequency in LLM output would tend to reflect noun-phrase frequency in training data in a similar context (disregarding enforced bias induced through RLHF and other tuning at the moment)
I'm sure there will be cross-fertilization from LLM to Human and back, but I'm not seeing the data yet that the influence on word-frequency is that outspoken.
The author seems to have some other objections to the rise of LLM's, which I fully understand.
The fact that making this distinction is impossible is reason enough to stop.
Even granting that we can disregard a really huge factor here, which I'm not sure we really can, one can not know beforehand how the clustering of the vocabulary is going to go pre-training, and its speculated that both at the center and at the edges of clusters we get random particularities. Hence the "solidgoldmagikarp" phenomenon and many others.
there is almost certainly organic growth as well as more people in Nigeria and other SSA countries are getting very good internet penetration in recent years
Think of an LLM as a person on the internet. Just like everyone else, they have their own vocabulary and preferred way of talking which means they’ll use some words more than others. Now imagine we duplicate this hypothetical person an incredible amount of times and have their clones chatter on the internet frequently. ‘Certainly’ this would have an effect.
Yes but this person learned to mimic the internet at large. Theoretically its preferred way of talking would be the average of all training data, as mimicry is GPT's training objective, and would therefore have very similar word distributions. Only, this doesn't account for RLHF and prompts spreading memetically among users.
> Theoretically its preferred way of talking is would be the average of all the training data
This is incorrect. Furthermore, what the LLM says is also determined by what its user wants it to say, and how frequently the user wants the LLM to post on the internet. This will have a large effect on the internet’s word frequency distribution.
If only we had a data set that measured word frequency across the internet as we're getting more and more into AI being used... Maybe with a baseline from before 2021 for comparison... But no let's just stop measuring word frequency entirely because we can just assume what will happen and we're angry.
It'd be in fact quite the opposite. There comes a turning point where the majority of language usage would actually be written by AI, at which point we'd no longer be analysing the word frequency/usage by actual humans and so it wouldn't be representative of how humans actually communicate.
Or potentially even more dystopian would be that AI slop would be dictating/driving human communication going forward.