Why wordfreq will not be updated

1707 points by tomthe a year ago

bane a year ago

This is one of the vanguards warning of the changes coming in the post-AI world.

>> Generative AI has polluted the data

Just like low-background steel marks the break in history from before and after the nuclear age, these types of data mark the distinction from before and after AI.

Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn. Meanwhile certain low-frequency data will be selected by these models less and less and will become suppressed and possibly eliminated. We know from classic NLP techniques that low frequency words are often among the highest in information content and descriptive power.

Bitrot will continue to act as the agent of Entropy further reducing pre-AI datasets.

These feedback loops will persist, language will be ground down, neologisms will be prevented and...society, no longer with the mental tools to describe changing circumstances; new thoughts unable to be realized, will cease to advance and then regress.

Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.

The glimmer of hope is that humanity figured out how to rise from the dreamstate of the world of animals once. Future humans will be able to climb from the ashes again. There used to be a word, the name of a bird, that encoded this ability to die and return again, but that name is already lost to the machines that will take our tongues.

Reply View 12 replies

fer a year ago

> Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn.
That's why on FB I mark my own writing as AI generated, and the AI generated slop as genuine. Because what is disguised as "transparency disclaimer" is just flagging content of what's a potential dataset to train from and what isn't.

Reply View | 3 replies
- mitthrowaway2 a year ago
  
  I'm sorry for the low-content remark, but, oh my god... I never thought about doing this, and now my mind is reeling at the implications. The idea of shielding my own writing from AI-plagiarism by masquerading it as AI-generated slop in the first place... but then in the same stroke, further undermining our collective ability to identify genuine human writing, while also flagging my own work as low-value to my readers, hoping that they can read between the lines. It's a fascinating play.
  
  Reply View | 0 replies
- Calzifer a year ago
  
  Reminds me of the good old times of first generation Google ReCaptcha where I always only entered the one word Google knows and ignored or intentionally mistyped the other.
  
  Reply View | 0 replies
- aanet a year ago
  
  You, Sir, may have stumbled upon the just the -hack- advice needed to post on social media.
  Apropos of nothing in particular, see LinkedIn now admitting [1] it is training its AI models on "all users by default"
  [1] https://www.techmeme.com/240918/p34#a240918p34
  
  Reply View | 0 replies
midnitewarrior a year ago

From the day of the first spoken word, humans have guided the development of language through conversational use and institution. With the advent of AI being used to publish documents into the open web, humans have given up their exclusive domain.
What would it take for Open AI overlords to inject words they want to force into usage in their models and will new words into use? Few have had the power to do such things. Open AI through its popular GPT platform now has the potential of dictating the evolution of human language.
This is novel and scary.

Reply View | 1 reply
- bane a year ago
  
  It's the ultimate seizure of the means of production, and in the end it will be the capitalists who realize that revolution.
  
  Reply View | 0 replies
wvbdmp a year ago

I Have No Words, And I Must Scream

Reply View | 0 replies
thechao a year ago

That went off the rails quickly. Calm down dude: my mother-in-law isn't going to forget words because of AI; she's gonna forget words because she's 3 glasses of crappy Texas wine into the evening.

Reply View | 3 replies
- bane a year ago
  
  But your children's children will never learn about love because that word will have been mechanically trained out of existence.
  
  Reply View | 2 replies
  
  Intralexical a year ago
  
  That's pretty funny. You think love is just a word?
  
  Reply View | 1 reply
  
  bane a year ago
  
  I leave it up to the reader to determine how serious I may be.
  
  Reply View | 0 replies
Intralexical a year ago

> Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.
Or we'll be fine, because inbreeding isn't actually sustainable either economically nor technologically, and to most of the world the Silicon Valley "AI" crowd is more an obnoxious gang of socially stunted and predatory weirdos than some unstoppable omnipotent force.

Reply View | 0 replies

oneeyedpigeon a year ago

I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.

Reply View 8 replies

Miraltar a year ago

I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder

Reply View | 7 replies
- Retr0id a year ago
  
  Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes.
  
  Reply View | 5 replies
  
  rvnx a year ago
  
  “Seamless”, “honed”, “unparalleled”, “delve” are now polluting the landscape because of monkeys repeating what ChatGPT says without even questioning what the words mean.
  Everything is “seamless” nowadays. Like I am seamlessly commenting here.
  Arguably, the meaning of these words evolve due to misuse too.
  
  Reply View | 2 replies
  
  pavel_lishin a year ago
  
  > I bet humans have started using "delve" more too, on average.
  I wish there were a way to check.
  
  Reply View | 1 reply
  
  linhns a year ago
  
  I'm seeing more and more of uses of it on this thread.
  
  Reply View | 0 replies
- wpietri a year ago
  
  One way to tackle it would be to use LLMs to generate synthetic corpuses, so you have some good fingerprints for pollution. But even there I'm not sure how doable that is given the speed at which LLMs are being updated. Even if I know a particular page was created in, say, January 2023, I may no longer be able to try to generate something similar now to see how suspect it is, because the precise setups of the moment may no longer be available.
  
  Reply View | 0 replies

greentxt a year ago

I think this person has too high a view of pre-2021, probably for ego reasons. In fact, their attitude seems very ego driven. AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021, it was much harder if not impossible to detect. If anything, it's probably easier now since people are all using the same ai that use words like delve so much much it becomes obvious.

Reply View 4 replies

croes a year ago

>AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021
But we do know that now it's a lot more, with a big LOT.

Reply View | 3 replies
- greentxt a year ago
  
  I assume you are correct but how can we know rather than assume? I am not sure we can, so why get worked up about "internet died in 2021" when many would claim with similar conviction that it's been dead since 2012, or 2007, or ...
  
  Reply View | 2 replies
  
  ClassyJacket a year ago
  
  You are making a claim that somehow someone was sitting on something as powerful as ChatGPT, long before ChatGPT, and that it was in widespread use, secretly, without even a single leak by anyone at any point. That's not plausible.
  
  Reply View | 1 reply
  
  nlpparty a year ago
  
  Twitter has been accused of being full of bots long before ChatGPT appeared. For 140 symbols, a template with synonyms would be enough to create mass-generated content.
  
  Reply View | 0 replies

miguno a year ago

I have been noticing this trend increasingly myself. It's getting more and more difficult to use tools like Google search to find relevant content.

Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.

On the positive side:

1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.

2. As a result of (1), I have started to use an RSS reader again. Who would have thought!

I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.

Reply View 1 reply

nlpparty a year ago

It has been for me the last 15 years like this.

Reply View | 0 replies

jgord a year ago

We will soon face another kind of bit-rot : where so much text is generated by LLMs that it pollutes the human natural language corpus available for training, on the web.

Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.

But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.

Reply View 1 reply

ipaddr a year ago

Old content can make money now and will be more valuable why wouldn't it happen more frequency?

Reply View | 0 replies

jchook a year ago

If it is (apparently) easy for humans to tell when content is AI-generated slop, then it should be possible to develop an AI to distinguish human-created content.

As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.

Reply View 1 reply

aDyslecticCrow a year ago

> If it is (apparently) easy for humans to tell when content is AI-generated slop
It's actually not. It's rather difficult for humans as well. We can see verbose text that is confused and call it AI, but it could just be a human aswell.
To borrow an older model training method, "Generative adversarial network". If we can distinguish AI from humans... We can use it to improve AI and close the gap.
So, it becomes an arms race that constantly evolves.

Reply View | 0 replies

sashank_1509 a year ago

Not to be too dismissive, but is there a worthwhile direction of research to pursue that is not LLM’s in NLP?

If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?

Reply View 1 reply

sinkasapa a year ago

Maybe it is "including linguistics" but most of the world's languages don't have the data available to train on. So I think one major question for NLP is exactly the question you posed: "Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?"

Reply View | 0 replies

aucisson_masque a year ago

It could be used to spot LLM generated text.

compare the frequency of words to those used in human natural writings and you spot the computer from the human.

Reply View 7 replies

Lvl999Noob a year ago

It could be used to differentiate LLM text from pre-LLM human text maybe. The thing, our AIs may not be very good at learning but our brains are. The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us. I believe there was a study (or a few anecdotes) where college papers checked for AI material were marked AI written even though they were written by humans because the students used AI during their studying and learned from it.

Reply View | 3 replies
- MPSimmons a year ago
  
  You're exactly right. You only have to look at the prevalence of the word "unalive" in real life contexts to find an example.
  
  Reply View | 0 replies
- thfuran a year ago
  
  >our AIs may not be very good at learning but our brains are
  Brains aren't nearly as good at slightly adjusting the statistical properties of a text corpus as computers are.
  
  Reply View | 0 replies
- left-struck a year ago
  
  > The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us
  Hmm I don’t disagree but I think it will be valuable skill going forward to write text that doesn’t read like it was written by an LLM
  This is an arms race that I’m not sure we can win though. It’s almost like a GAN.
  
  Reply View | 0 replies
TacticalCoder a year ago

> ... compare the frequency of words to those used in human natural writings and you spot the computer from the human.
But that's a losing endeavor: if you can do that, you can immediately ask your LLM to fix its output so that it passes that test (and many others). It can introduce typos, make small errors on purpose, and anything you can think of to make it look human.

Reply View | 0 replies
ithkuil a year ago

it may work for a short time, but after a while natural language will evolve due to natural exposure of those new words or word patterns and even human will write in ways that, while being different from the LLMs, will also be different from the snapshot captured by this snapshot. It's already the case that we used to write differently 20 years ago from 50 years ago and even more so 100 years ago, etc

Reply View | 0 replies
slashdave a year ago

Hardly. You are talking about a statistical test, which will have rather large errors (since it is based on word frequencies). Not to mention word frequencies will vary depending on the type of text (essay, description, advertisement, etc).

Reply View | 0 replies

karaterobot a year ago

I guess a manageable, still-useful alternative would be to curate a whitelist of sources that don't use AI, and without making that list public, derive the word frequencies from only those sources. How to compile that list is left as an exercise for the reader. The result would not be as accurate as a broad sample of the web, but in a world where it's impossible to trust a broad sample of the web, it the option you are left with. And I have no reason to doubt that it could be done at a useful scale.

I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.

Reply View 1 reply

WaitWaitWha a year ago

> curate a whitelist of sources that don't use AI,
I like this.
Maybe even take it a step further - have a badge on the source that is both human and machine visible to indicate that the content is not AI generated.

Reply View | 0 replies

PeterStuer a year ago

Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?

Reply View 10 replies

Jcampuzano2 a year ago

It'd be in fact quite the opposite. There comes a turning point where the majority of language usage would actually be written by AI, at which point we'd no longer be analysing the word frequency/usage by actual humans and so it wouldn't be representative of how humans actually communicate.
Or potentially even more dystopian would be that AI slop would be dictating/driving human communication going forward.

Reply View | 0 replies
baq a year ago

‘delve’ is given as an example right there in TFA.

Reply View | 4 replies
- PeterStuer a year ago
  
  Yes, but the material presented in no way makes distiction between potential organic growth of 'delve' vs. LLM induced use. They just note that even though 'delve' was on the rise, in 23-24 the word gains more popularity, at the same time ChatGPT rose. Word adoption is certainly not a linear phenomenon. And as the author states 'I don't think anyone has reliable information about post-2021 language usage by humans'
  So I would still state noun-phrase frequency in LLM output would tend to reflect noun-phrase frequency in training data in a similar context (disregarding enforced bias induced through RLHF and other tuning at the moment)
  I'm sure there will be cross-fertilization from LLM to Human and back, but I'm not seeing the data yet that the influence on word-frequency is that outspoken.
  The author seems to have some other objections to the rise of LLM's, which I fully understand.
  
  Reply View | 2 replies
  
  QuiDortDine a year ago
  
  The fact that making this distinction is impossible is reason enough to stop.
  
  Reply View | 0 replies
  
  beepbooptheory a year ago
  
  Even granting that we can disregard a really huge factor here, which I'm not sure we really can, one can not know beforehand how the clustering of the vocabulary is going to go pre-training, and its speculated that both at the center and at the edges of clusters we get random particularities. Hence the "solidgoldmagikarp" phenomenon and many others.
  
  Reply View | 0 replies
- whimsicalism a year ago
  
  there is almost certainly organic growth as well as more people in Nigeria and other SSA countries are getting very good internet penetration in recent years
  
  Reply View | 0 replies
joshdavham a year ago

Think of an LLM as a person on the internet. Just like everyone else, they have their own vocabulary and preferred way of talking which means they’ll use some words more than others. Now imagine we duplicate this hypothetical person an incredible amount of times and have their clones chatter on the internet frequently. ‘Certainly’ this would have an effect.

Reply View | 2 replies
- efskap a year ago
  
  Yes but this person learned to mimic the internet at large. Theoretically its preferred way of talking would be the average of all training data, as mimicry is GPT's training objective, and would therefore have very similar word distributions. Only, this doesn't account for RLHF and prompts spreading memetically among users.
  
  Reply View | 1 reply
  
  joshdavham a year ago
  
  > Theoretically its preferred way of talking is would be the average of all the training data
  This is incorrect. Furthermore, what the LLM says is also determined by what its user wants it to say, and how frequently the user wants the LLM to post on the internet. This will have a large effect on the internet’s word frequency distribution.
  
  Reply View | 0 replies
cdrini a year ago

If only we had a data set that measured word frequency across the internet as we're getting more and more into AI being used... Maybe with a baseline from before 2021 for comparison... But no let's just stop measuring word frequency entirely because we can just assume what will happen and we're angry.

Reply View | 0 replies

charlieyu1 a year ago

Web before 2021 was still polluted by content farms. The articles were written by humans, but still, they were rubbish. Not compared to current rate of generation, but the web was already dominated by them.

Reply View 1 reply

devjab a year ago

Maybe, it if you’re studying the way humans use language you’re still getting human made data from rubbish. There isn’t any value in AI generated content is what you’re cataloging is human language.

Reply View | 0 replies

altcognito a year ago

It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.

Might even change the tool name.

Reply View 2 replies

jpjoi a year ago

The point was it’s getting harder and harder to do that as things get locked down or go behind a massive paywall to either profit off of or avoid being used in generative AI. The places where previous versions got data is impossible to gather from anymore so the dataset you would collect would be completely different, which (might) cause weird skewing.

Reply View | 1 reply
- oneeyedpigeon a year ago
  
  But that would always be the case. Twitter will not last forever; heck, it may not even be long before an open alternative like Bluesky competes with it. Would be interesting to know what percentage of the original mined data was from Twitter.
  
  Reply View | 0 replies

jadayesnaamsi a year ago

The year 2021 is to wordfreq what 1945 was to carbon carbon-14 dating.

I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.

Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.