oneeyedpigeon 2 days ago

I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.

  • Miraltar 2 days ago

    I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder

    • Retr0id 2 days ago

      Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes.

      • rvnx 2 days ago

        “Seamless”, “honed”, “unparalleled”, “delve” are now polluting the landscape because of monkeys repeating what ChatGPT says without even questioning what the words mean.

        Everything is “seamless” nowadays. Like I am seamlessly commenting here.

        Arguably, the meaning of these words evolve due to misuse too.

      • pavel_lishin 2 days ago

        > I bet humans have started using "delve" more too, on average.

        I wish there were a way to check.

        • linhns 7 hours ago

          I'm seeing more and more of uses of it on this thread.

    • wpietri 2 days ago

      One way to tackle it would be to use LLMs to generate synthetic corpuses, so you have some good fingerprints for pollution. But even there I'm not sure how doable that is given the speed at which LLMs are being updated. Even if I know a particular page was created in, say, January 2023, I may no longer be able to try to generate something similar now to see how suspect it is, because the precise setups of the moment may no longer be available.

greentxt 2 days ago

I think this person has too high a view of pre-2021, probably for ego reasons. In fact, their attitude seems very ego driven. AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021, it was much harder if not impossible to detect. If anything, it's probably easier now since people are all using the same ai that use words like delve so much much it becomes obvious.

  • croes 2 days ago

    >AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021

    But we do know that now it's a lot more, with a big LOT.

    • greentxt 2 days ago

      I assume you are correct but how can we know rather than assume? I am not sure we can, so why get worked up about "internet died in 2021" when many would claim with similar conviction that it's been dead since 2012, or 2007, or ...

      • ClassyJacket 2 days ago

        You are making a claim that somehow someone was sitting on something as powerful as ChatGPT, long before ChatGPT, and that it was in widespread use, secretly, without even a single leak by anyone at any point. That's not plausible.

        • nlpparty a day ago

          Twitter has been accused of being full of bots long before ChatGPT appeared. For 140 symbols, a template with synonyms would be enough to create mass-generated content.

miguno 2 days ago

I have been noticing this trend increasingly myself. It's getting more and more difficult to use tools like Google search to find relevant content.

Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.

On the positive side:

1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.

2. As a result of (1), I have started to use an RSS reader again. Who would have thought!

I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.

jgord a day ago

We will soon face another kind of bit-rot : where so much text is generated by LLMs that it pollutes the human natural language corpus available for training, on the web.

Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.

But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.

  • ipaddr a day ago

    Old content can make money now and will be more valuable why wouldn't it happen more frequency?

jchook 2 days ago

If it is (apparently) easy for humans to tell when content is AI-generated slop, then it should be possible to develop an AI to distinguish human-created content.

As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.

  • aDyslecticCrow 2 days ago

    > If it is (apparently) easy for humans to tell when content is AI-generated slop

    It's actually not. It's rather difficult for humans as well. We can see verbose text that is confused and call it AI, but it could just be a human aswell.

    To borrow an older model training method, "Generative adversarial network". If we can distinguish AI from humans... We can use it to improve AI and close the gap.

    So, it becomes an arms race that constantly evolves.

sashank_1509 2 days ago

Not to be too dismissive, but is there a worthwhile direction of research to pursue that is not LLM’s in NLP?

If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?

  • sinkasapa 2 days ago

    Maybe it is "including linguistics" but most of the world's languages don't have the data available to train on. So I think one major question for NLP is exactly the question you posed: "Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?"

aucisson_masque 2 days ago

It could be used to spot LLM generated text.

compare the frequency of words to those used in human natural writings and you spot the computer from the human.

  • Lvl999Noob 2 days ago

    It could be used to differentiate LLM text from pre-LLM human text maybe. The thing, our AIs may not be very good at learning but our brains are. The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us. I believe there was a study (or a few anecdotes) where college papers checked for AI material were marked AI written even though they were written by humans because the students used AI during their studying and learned from it.

    • MPSimmons 2 days ago

      You're exactly right. You only have to look at the prevalence of the word "unalive" in real life contexts to find an example.

    • thfuran 2 days ago

      >our AIs may not be very good at learning but our brains are

      Brains aren't nearly as good at slightly adjusting the statistical properties of a text corpus as computers are.

    • left-struck 2 days ago

      > The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us

      Hmm I don’t disagree but I think it will be valuable skill going forward to write text that doesn’t read like it was written by an LLM

      This is an arms race that I’m not sure we can win though. It’s almost like a GAN.

  • ithkuil 2 days ago

    it may work for a short time, but after a while natural language will evolve due to natural exposure of those new words or word patterns and even human will write in ways that, while being different from the LLMs, will also be different from the snapshot captured by this snapshot. It's already the case that we used to write differently 20 years ago from 50 years ago and even more so 100 years ago, etc

  • slashdave 2 days ago

    Hardly. You are talking about a statistical test, which will have rather large errors (since it is based on word frequencies). Not to mention word frequencies will vary depending on the type of text (essay, description, advertisement, etc).

  • TacticalCoder 2 days ago

    > ... compare the frequency of words to those used in human natural writings and you spot the computer from the human.

    But that's a losing endeavor: if you can do that, you can immediately ask your LLM to fix its output so that it passes that test (and many others). It can introduce typos, make small errors on purpose, and anything you can think of to make it look human.

karaterobot 2 days ago

I guess a manageable, still-useful alternative would be to curate a whitelist of sources that don't use AI, and without making that list public, derive the word frequencies from only those sources. How to compile that list is left as an exercise for the reader. The result would not be as accurate as a broad sample of the web, but in a world where it's impossible to trust a broad sample of the web, it the option you are left with. And I have no reason to doubt that it could be done at a useful scale.

I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.

  • WaitWaitWha 2 days ago

    > curate a whitelist of sources that don't use AI,

    I like this.

    Maybe even take it a step further - have a badge on the source that is both human and machine visible to indicate that the content is not AI generated.

PeterStuer 2 days ago

Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?

  • Jcampuzano2 2 days ago

    It'd be in fact quite the opposite. There comes a turning point where the majority of language usage would actually be written by AI, at which point we'd no longer be analysing the word frequency/usage by actual humans and so it wouldn't be representative of how humans actually communicate.

    Or potentially even more dystopian would be that AI slop would be dictating/driving human communication going forward.

  • baq 2 days ago

    ‘delve’ is given as an example right there in TFA.

    • PeterStuer 2 days ago

      Yes, but the material presented in no way makes distiction between potential organic growth of 'delve' vs. LLM induced use. They just note that even though 'delve' was on the rise, in 23-24 the word gains more popularity, at the same time ChatGPT rose. Word adoption is certainly not a linear phenomenon. And as the author states 'I don't think anyone has reliable information about post-2021 language usage by humans'

      So I would still state noun-phrase frequency in LLM output would tend to reflect noun-phrase frequency in training data in a similar context (disregarding enforced bias induced through RLHF and other tuning at the moment)

      I'm sure there will be cross-fertilization from LLM to Human and back, but I'm not seeing the data yet that the influence on word-frequency is that outspoken.

      The author seems to have some other objections to the rise of LLM's, which I fully understand.

      • QuiDortDine 2 days ago

        The fact that making this distinction is impossible is reason enough to stop.

      • beepbooptheory 2 days ago

        Even granting that we can disregard a really huge factor here, which I'm not sure we really can, one can not know beforehand how the clustering of the vocabulary is going to go pre-training, and its speculated that both at the center and at the edges of clusters we get random particularities. Hence the "solidgoldmagikarp" phenomenon and many others.

    • whimsicalism 2 days ago

      there is almost certainly organic growth as well as more people in Nigeria and other SSA countries are getting very good internet penetration in recent years

  • joshdavham 2 days ago

    Think of an LLM as a person on the internet. Just like everyone else, they have their own vocabulary and preferred way of talking which means they’ll use some words more than others. Now imagine we duplicate this hypothetical person an incredible amount of times and have their clones chatter on the internet frequently. ‘Certainly’ this would have an effect.

    • efskap a day ago

      Yes but this person learned to mimic the internet at large. Theoretically its preferred way of talking would be the average of all training data, as mimicry is GPT's training objective, and would therefore have very similar word distributions. Only, this doesn't account for RLHF and prompts spreading memetically among users.

      • joshdavham 21 hours ago

        > Theoretically its preferred way of talking is would be the average of all the training data

        This is incorrect. Furthermore, what the LLM says is also determined by what its user wants it to say, and how frequently the user wants the LLM to post on the internet. This will have a large effect on the internet’s word frequency distribution.

  • cdrini a day ago

    If only we had a data set that measured word frequency across the internet as we're getting more and more into AI being used... Maybe with a baseline from before 2021 for comparison... But no let's just stop measuring word frequency entirely because we can just assume what will happen and we're angry.

charlieyu1 2 days ago

Web before 2021 was still polluted by content farms. The articles were written by humans, but still, they were rubbish. Not compared to current rate of generation, but the web was already dominated by them.

  • devjab a day ago

    Maybe, it if you’re studying the way humans use language you’re still getting human made data from rubbish. There isn’t any value in AI generated content is what you’re cataloging is human language.

altcognito 2 days ago

It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.

Might even change the tool name.

  • jpjoi 2 days ago

    The point was it’s getting harder and harder to do that as things get locked down or go behind a massive paywall to either profit off of or avoid being used in generative AI. The places where previous versions got data is impossible to gather from anymore so the dataset you would collect would be completely different, which (might) cause weird skewing.

    • oneeyedpigeon 2 days ago

      But that would always be the case. Twitter will not last forever; heck, it may not even be long before an open alternative like Bluesky competes with it. Would be interesting to know what percentage of the original mined data was from Twitter.

jadayesnaamsi 2 days ago

The year 2021 is to wordfreq what 1945 was to carbon carbon-14 dating.

I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.

Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.

ilaksh 2 days ago

Reading through this entire thread, I suspect that somehow generative AI actually became a political issue. Polarized politics is like a vortex sucking all kinds of unrelated things in.

In case that doesn't get my comment completely buried, I will go ahead and say honestly that even though "AI slop" and paywalled content is a problem, I don't think that generative AI in itself is a negative at all. And I also think that part of this person's reaction is that LLMs have made previous NLP techniques, such a those based on simple usage counts etc., largely irrelevant.

What was/is wordfreq used for, and can those tasks not actually be done more effectively with a cutting edge language model of some sort these days? Maybe even a really small one for some things.

  • ecshafer 2 days ago

    Generative AI is inherently a political issue, its not surprising at all.

    There is the case of what is "truth". As soon as you start to ensure some quality of truth to what is generated, that is political.

    As soon as generative AI has the capability to take someone's job, that is political.

    The instant AI can make someone money, it is political.

    When AI is trained on something that someone has created, and now they can generate something similar, it is political.

    • whimsicalism 2 days ago

      > As soon as generative AI has the capability to take someone's job, that is political.

      What is political is people enshrining themselves in chokepoints and demanding a toll for passing through or getting anything done. That is what you do when you make a certain job politically 'untakable'.

      People who espouse that the 'personal is political' risk making the definition of politics so broad that it is useless.

    • ilaksh 2 days ago

      Then .. everything is political?

      • commodoreboxer 2 days ago

        Everything involving any kind of coordination, cooperation, competition, and/ot communication between two or more people involves politics by its very nature. LLMs are communication tools. You can't divorce politics from their use when one person is generating text for another person to read.

      • JohnFen 2 days ago

        "Just because you do not take an interest in politics doesn't mean politics won't take an interest in you." -- Pericles

  • rincebrain 2 days ago

    The simplest example that comes to mind of something frequency analysis might be useful for would be if you had simple ciphertext where you knew that the characters probably 1:1 mapped, but you didn't know anything about how.

    It could also be useful for guessing whether someone might have been trying to do some kind of steganographic or additional encoding in their work, by telling you how abnormal compared to how many people write it is that someone happened to choose a very unusual construction in their work, or whether it's unlikely that two people chose the same unusual construction by coincidence or plagiarism.

    You might also find statistical models interesting for things like noticing patterns in people for whom English or others are not their first language, and when they choose different constructions more often than speakers for whom it was their first language.

    I'm not saying you can't use an LLM to do some or all of these, but they also have something of a scalar attached to them of how unusual the conclusion is - e.g. "I have never seen this construction of words in 50 million lines of text" versus "Yes, that's natural.", which can be useful for trying to inform how close to the noise floor the answer is, even ignoring the prospect of hallucinations.

  • whimsicalism 2 days ago

    Yes, it's become extremely politicized and its very tiresome. Tech in general, to be frank. Pray that your field of interest never gets covered in the NYT.

donatj 2 days ago

I hear this complaint often but in reality I have encountered fairly little content in my day to day that has felt fully AI generated? AI assisted sure, but is that a problem if a human is in the mix, curating?

I certainly have not encountered enough straight drivel where I would think it would have a significant effect on overall word statistics.

I suspect there may be some over-identification of AI content happening, a sort of Baader–Meinhof effect cognitive bias. People have their eye out for it and suddenly everything that reads a little weird logically "must be AI generated" and isn't just a bad human writer.

Maybe I am biased, about a decade ago I worked for an SEO company with a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read. It would rot your brain if you tried, and it was written by hand by a team of humans beings. This existed WELL before generative AI.

  • pavel_lishin 2 days ago

    > I hear this complaint often but in reality I have encountered fairly little content in my day to day that has felt fully AI generated?

    How confident are you in this assessment?

    > straight drivel

    We're past the point where what AI generates is "straight drivel"; every minute, it's harder to distinguish AI output from actual output unless you're approaching expertise in the subject being written about.

    > a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read.

    And now a machine can generate the same amount of output in 30 seconds. Scale matters.

    • PhunkyPhil 2 days ago

      > every minute, it's harder to distinguish AI output from actual output unless you're approaching expertise in the subject being written about.

      So, then what really is the problem with just including LLM-generated text in wordfreq?

      If quirky word distributions will remain a "problem", then I'd bet that human distributions for those words will follow shortly after (people are very quick to change their speech based on their environment, it's why language can change so quickly).

      Why not just own the fact that LLMs are going to be affecting our speech?

      • pavel_lishin a day ago

        > So, then what really is the problem with just including LLM-generated text in wordfreq?

        > Why not just own the fact that LLMs are going to be affecting our speech?

        The problem is that we cannot tell what's a result of LLMs affecting our speech, and what's just the output of LLMs.

        If LLMs result in a 10% increase of the word "gimple" online, which then results in a 1% increase of humans using the word "gimple" online, how do we measure that? Simply continuing to use the web to update wordfreq would show a 10% increase, which is incorrect.

diggan 2 days ago

One of the examples is the increased usage of "delve" which Google Trends confirms increased in usage since 2022 (initial ChatGPT release): https://trends.google.com/trends/explore?date=all&q=delve&hl...

It seems however it started increasing most in usage just these last few months, maybe people are talking more about "delve" specifically because of the increase in usage? A usage recursion of some sorts.

  • bee_rider 2 days ago

    We’ve seen this with a couple words and expressions, and I don’t doubt that AI is somewhat likely to “like” some phrases for whatever reason. Big eigenvaues of the latent space or whatever, hahaha (I don’t know AI).

    But also, words and phrases do become popular among humans, right? It would be a shame if AI caused the language to get more stagnant, as keeping up with which phrases are popular get you labeled as an AI.

    • cdrini a day ago

      Exactly, like how "mindful" and "demure" recently became more popular for seemingly no reason. Humans do this all the time.

      And language in general stagnates and shrinks in vocabulary over time ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). (Link that ChatGPT helped me find :P) I think AI will increase the average persons vocabulary, since it appears to in general be better/more professionally written than a lot of what the average person is exposed to online.

  • bongodongobob 2 days ago

    Delves are a new thing in World of Warcraft released 9/10 this year. Delve is also an M365 product that has been around for some time and is being discontinued in December. So no, that has nothing to do with LLMs.

    • _proofs 2 days ago

      Delve was also an addition to PoE, which I imagine had its own spike in google searches relative to that word.

  • nlpparty a day ago

    If your select only USA, the trend disappears.

hcks 2 days ago

Okay but how big of a sample size do we even actually need for word frequencies? Like what’s the goal here? It looks like the initial project isn’t even stratified per year/decade

nlpparty 2 days ago

It's just inevitable. Imagine a world where we get a cheap and accessable AGI. Most work in the world will be done by it. Certainly, it will organise the work the way it finds more preferable. Humans (and other AIs) will find it much harder to train from example as most of the work is performed in the same uniform way. The AI revolution should start with the field closest to its roots.

nlpparty a day ago

https://trends.google.com/trends/explore?date=all&geo=US&q=d...

The funny fact: It doesn't result in the increase for search results for "delve".

  • 1d22a a day ago

    That chart shows people searching for the world delve, and isn't (directly) related to the incidence of words in content on the open web.

    • nlpparty a day ago

      I just assumed that if many people, especially not proficient language users encounter this word in the text generated by ChatGPT they would look it up.

jhack a day ago

Kind of weird to believe “slop” didn’t exist on the internet in mass quantities before AI.

avazhi a day ago

I agree with the general ethos of the piece (albeit a few of the details are puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel, nor does what Reddit is doing make much intellectual as opposed to economic [IPO-influenced] sense - but this line:

'OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.'

really does betray some real naivete. OpenAI and Google could literally burn $10million dollars per day (okay, maybe not OpenAI - but Google surely could) and reasonably fail to notice. Whatever costs those companies have to pay to collect training data will be well worth it to them. Any messes made in the course of obtaining that data will be dealt with by an army of employees either manually cleaning up the data, or by algorithms Google has its own LLM write for itself.

I do find the general sense of impending dystopian inhumanity arising out of the explosion of LLMs to be super fascinating (and completely understandable).

  • devjab a day ago

    > puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel

    Maybe this is because I’m European, but what is partisan about calling X invariably worthless drivel? Seems a lot like facts to me considering what has been going on with the platform moderation since Elon Musk bought it. It’s so bad that the EU consider it a platform for misinformation these days.

    • cdrini a day ago

      Do you have a citation on that last claim?

    • avazhi a day ago

      Because the author specifically mentioned that it's worthless because it's 'right-wing' (a 'right-wing cesspool'), as if there aren't plenty of people espousing left-wing views on the platform. The right-wing comment in particular is what makes the statement blatantly partisan.

    • bakugo a day ago

      > It’s so bad that the EU consider it a platform for misinformation these days.

      Can you define "misinformation"? Is it just things the government disagrees with?

jedberg 2 days ago

We need a vintage data/handmade data service. A service that can provide text and images for training that are guaranteed to have either been produced by a human or produced before 2021.

Someone should start scanning all those microfiche archives in local libraries and sell the data.

WalterBright a day ago

I've wondered from time to time why I collect history books, keep my encyclopedias, when I could just google it. Now I know why. They predate AI and are unpolluted by generated bilge.

zaik 2 days ago

If generative AI has a significantly different word frequency from humans then it also shouldn't be hard to detect text written generative AI. However my last information is that tools to detect text written by generative AI are not that great.

DebtDeflation 2 days ago

Enshittification is accelerating. A good 70% of my Facebook feed is now obviously AI generated images with AI generated text blurbs that have nothing to do with the accompanying images likely posted by overseas bot farms. I'm also noticing more and more "books" on Amazon that are clearly AI generated and self published.

  • janice1999 2 days ago

    It's okay. Amazon has limited authors to self publishing only 3 books per day (yes, really). That will surely solve the problem.

    • wpietri 2 days ago

      Hah! I'm trying to figure out the exact date that crossed from "plausible line from a Stross or Sterling novel" [1] to "of course they did".

      [1] Or maybe Sheckley or Lem, now that I think about it.

    • Drakim 2 days ago

      I read that as 3 books per year at first and thought to myself that that was a rather harsh limitation but surely any true respectable author wouldn't be spitting more than that...

      ...and then I realized you wrote 3 books a day. What the hell.

  • Sohcahtoa82 2 days ago

    > A good 70% of my Facebook feed is now obviously AI generated images with AI generated text blurbs that have nothing to do with the accompanying images likely posted by overseas bot farms.

    This is a self-inflicted problem, IMO.

    Do you just have shitty friends that share all that crap? Or are you following shitty pages?

    I use Facebook a decent amount, and I don't suffer from what you're complaining about. Your feed is made of what you make it. Unfollow the pages that make that crap. If you have friends that share it, consider unfriending or at the very least, unfollowing. Or just block the specific pages they're sharing posts from.

jijojohnxx a day ago

Sad to see wordfreq halted, it was a real party for linguistics enthusiasts. For those seeking new tools, keep expanding your knowledge with socialsignalai.

ok123456 2 days ago

Most of the "random" bot content pre-2021 was low-quality Markov-generated text. If anything, these genitive AI tools would improve the accuracy of scraping large corpora of text from the web.

yarg a day ago

Generative AI has done to human speech analysis what atmospheric testing did to carbon dating.

jonas21 2 days ago

I think the main reason for sunsetting the project is hinted at near the bottom:

> The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money.

Traditional NLP has been surpassed by transformers, making this project obsolete. The rest of the post reads like rationalization and sour grapes.

  • rovr138 a day ago

    I think the reason to sunset the project is actually near the top.

    > I don't think anyone has reliable information about post-2021 language usage by humans.

    It's information about language usage by humans. We know the rate at which generated text has increased after 2021. How do we filter this to only have data from humans?

    The bottom is just lamenting what's happening in the field (which is pretty much what everyone that's been doing anything with NLP research is also complaining about behind closed doors).

tqi 2 days ago

"Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable."

How sure can we be about that?

iamnotsure 2 days ago

"Multi-script languages

Two of the languages we support, Serbian and Chinese, are written in multiple scripts. To avoid spurious differences in word frequencies, we automatically transliterate the characters in these languages when looking up their words.

Serbian text written in Cyrillic letters is automatically converted to Latin letters, using standard Serbian transliteration, when the requested language is sr or sh."

I'd support keeping both scripts (српска ћирилица and latin script) , similarly to hiragana (ひらがな) and katakana (カタカナ) in Japanese.

thesnide 2 days ago

I think that text on the internet will tainted by AI the same way that steel has being tainted by nuclear devices.

andai 2 days ago

Has anyone taken a look at a random sample of web data? It's mostly crap. I was thinking of making my own search engine, knowledge database etc based on a random sample of web pages, but I found that almost all of them were drivel. Flame wars, asinine blog posts, and most of all, advertising. Forget spam, most of the legit pages are trying to sell something too!

The conclusion I arrived at was that making my own crawler actually is feasible (and given my goals, necessary!) because I'm only interested in a very, very small fraction of what's out there.

  • andai a day ago

    The unspoken question here, of course, is "you wouldn't happen to have already done this for me?" ;)

joshdavham 2 days ago

If the language you’re processing was generated by AI, it’s no longer NLP, it’s ALP.

honksillet 2 days ago

Twitter was a botnet long before LLMs and Musk got involved.

aftbit 2 days ago

Wow there is so much vitriol both in this post and in the comments here. I understand that there are many ethical and practical problems with generative AI, but when did we stop being hopeful and start seeing the darkest side of everything? Is it just that the average HN reader is now past the age where a new technological development is an exciting opportunity and on to the age where it is a threat? Remember, the Luddites were not opposed to looms, they just wanted to own them.

  • aryonoco 2 days ago

    When?

    For some of us, it was 1994, the eternal September.

    For some of us, it was when Aaron Swartz left us.

    For some of us, it was when Google killed Google Reader (in hindsight, the turning point of Google becoming evil).

    For some others, like the author of this post, it's when twitter and reddit closed their previously open APIs.

  • JohnFen 2 days ago

    > when did we stop being hopeful and start seeing the darkest side of everything?

    I think a decade or two ago, when most of the new tech being introduced (at least by our industry) started being unmistakably abusive and dehumanizing. When the recent past shows a strong trend, it's not unreasonable to expect the the near future will continue that trend. Particularly when it makes companies money.

  • slashdave 2 days ago

    Give us examples of generative AI in challenging applications (biology, medicine, physical sciences), and you'll get a lot of optimism. The text LLM stuff is the brute force application of the same class of statistical modeling. It's commercial, and boring.

anovikov 2 days ago

Sad. I'd love to see by how much the use of world "delve" has increased since 2021...

  • Terretta 2 days ago

    > I'd love to see by how much the use of world "delve" has increased since 2021...

    There are charts / graphs in the link, both since 2021, and since earlier.

    The final graph suggests the phenomenon started earlier, possibly correlated in some way to Malaysian / Indian usages of English.

    It does seem OpenAI's family of GPTs as implemented in ChatGPT unspool concepts in a blend of India-based-consultancy English with American freshmen essay structure, frosted with superficially approachable or upbeat blogger prose ingratiatingly selling you something.

    Anthropic has clearly made efforts to steer this differently, Mistral and Meta as well but to a lesser degree.

    I've wondered if this reflects training material (the SEO is ruining the Internet theory), or is more simply explained by selection of pools of Hs hired for RLHF.

  • chipdart 2 days ago

    From the submission you're commenting on:

    > As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.

  • slashdave 2 days ago

    Amusing that we now have a feedback loop. Let's see... delve delve delve delve delve delve delve delve. There, I've done my part.

  • dqv 2 days ago

    Same for me but with the word “crucial”.

  • xpl 2 days ago

    The fun thing is that while GPTs initially learned from humans (because ~100% of the content was human-generated), future humans will learn from GPTs, because almost all available content would be GPT-generated very soon.

    This will surely affect how we speak. It's possible that human language evolution could come to a halt, stuck in time as AI datasets stop being updated.

    In the worst case, we will see a global "model collapse" with human languages devolving along with AI's, if future AIs are trained on their own outputs...

jijojohnxx a day ago

Looks like the wordfreq party is over. Time for the next wave of knowledge tools, wonder what socialsignalai could bring to the table.

eadmund 2 days ago

> the Web at large is full of slop generated by large language models, written by no one to communicate nothing

That’s neither fair nor accurate. That slop is ultimately generated by the humans who run those models; they are attempting (perhaps poorly) to communicate something.

> two companies that I already despise

Life’s too short to go through it hating others.

> it's very likely because they are creating a plagiarism machine that will claim your words as its own

That begs the question. Plagiarism has a particular definition. It is not at all clear that a machine learning from text should be treated any differently from a human being learning from text: i.e., duplicating exact phrases or failing to credit ideas may in some circumstances be plagiarism, but no-one is required to append a statement crediting every text he has ever read to every document he ever writes.

Credits: every document I have ever read grin

  • miningape 2 days ago

    This is just the "guns don't shoot people, people do." argument except in this case we quite literally have a massive upside incentive to remove people from the process entirely (i.e. websites that automatically generate new content every day) - so I don't buy it.

    This kind of AI slop is quite literally written by no one (an algorithm pushed it out), and it doesn't communicate anything since communication first requires some level of understanding of the source material - and LLM's are just predicting the likely next token without understanding. I would also extend this to AI slop written by someone with a limited domain understanding, they themselves have nothing new to offer, nor the expertise or experience to ensure the AI is producing valuable content.

    I would go even further and say it's "read by no one" - people are sick and tired of reading the next AI slop article on google and add stuff like "reddit" to the end of their queries to limit the amount of garbage they get.

    Sure there are people using LLMs to enhance their research, but a vast, vast majority are using it to create slop that hits a word limit.

  • slashdave 2 days ago

    > It is not at all clear that a machine learning from text should be treated any differently from a human being learning from text

    Given that LLMs and human creativity work on fundamentally different principles, there is every reason to believe there is a difference.

  • weevil 2 days ago

    I feel like you're giving certain entities too much credit there. Yes text is generated to do _something_, but it may not be to communicate in good-faith; it could be keyword-dense gibberish designed to attract unsuspecting search engine users for click revenue, or generate political misinformation disseminated to a network of independent-looking "news" websites, or pump certain areas with so much noise and nonsense information that those spaces cannot sustain any kind of meaningful human conversation.

    The issue with generative 'AI' isn't that they generate text, it's that they can (and are) used to generate high-volume low-cost nonsense at a scale no human could ever achieve without them.

    > Life’s too short to go through it hating others

    Only when they don't deserve it. I have my doubts about Google, but I've no love for OpenAI.

    > Plagiarism has a particular definition ... no-one is required to append a statement crediting every text he has ever read

    Of course they aren't, because we rightly treat humans learning to communicate differently from training computer code to predict words in a sentence and pass it off as natural language with intent behind it. Musicians usually pay royalties to those whose songs they sample, but authors don't pay royalties to other authors whose work inspired them to construct their own stories maybe using similar concepts. There's a line there somewhere; falsely equating plagiarism and inspiration (or natural language learning in humans) misses the point.

whimsicalism 2 days ago

NLP and especially 'computational linguistics' in academia has been captured by certain political interests, this is reflective of that.

will-burner 2 days ago

> It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google, two companies that I already despise.

The dependency on closed data combined with the cost of compute to do anything interesting with LLMs has made individual contributions to NLP research extremely difficult if one is not associated with a very large tech company. It's super unfortunate, makes the subject area much less approachable, and makes the people doing research in the subject area much more homogeneous.

antirez 2 days ago

Ok so post author is AI skeptic and this is his retaliation, likely because his work is affected. I believe governments should address the problem with welfare but being against technical advances is always being in the wrong side of history.

  • exo-pla-net 2 days ago

    This is a tech site, where >50% of us are programmers who have achieved greater productivity thanks to LLM advances.

    And yet we're filled to the gills with Luddite sentiments and AI content fearmongering.

    Imagine the hysteria and the skull-vibrating noise of the non-HN rabble when they come to understand where all of this is going. They're going to do their darndest to stop us from achieving post-economy.

    • devjab a day ago

      I think programmers are in the perfect profession to call LLMs out for just how bad they are. They are fancy auto-complete and I love them in my daily usage, but a big part of that is because I can tell when they are ridiculously wrong. Which is so often you really have to question how useful they would be for anything where they aren’t just fancy auto-complete.

      Which isn’t AIs fault. I’m sure they can be great in cancer detection, unless they replace what we’re already doing because they are cheaper than doctors. In combination with an expert AI is great, but that’s not what’s happening is it?

    • antirez 2 days ago

      I fail to see the difference. Actually, programming was one of the first field where LLMs shown proficiency. The helper nature of LLMs is true in all the fields so far, in the future this may change. I believe that for instance in the case or journalism the issue was already there: three euros per post written without clue by humans.

      Anyway in the long run AI will kill tons of jobs. Regardless of blog posts like that. The true key is governments assistance.

      • exo-pla-net 2 days ago

        I don't know what difference you are referring to. I was agreeing with you.

        And also agreed: many trumpet the merits of "unassisted" human output. However, they're suffering from ancestor veneration: human writing has always been a vast mine of worthless rock (slop) with a few gems of high-IQ analysis hidden here and there.

        For instance, upon the invention of the printing press, it was immediately and predominantly used for promulgating religious tracts.

        And even when you got to Newton, who created for us some valuable gems, much of his output was nevertheless deranged and worthless. [1]

        It follows that, whether we're a human or an LLM, if we achieve factual grounding and the capacity to reason, we achieve it despite the bulk of the information we ingest. Filtering out sludge is part of the required skillset for intellectual growth, and LLM slop qualitatively changes nothing.

        [1] https://www.newtonproject.ox.ac.uk/view/texts/diplomatic/THE...

floppiplopp 2 days ago

I really like the fact that the content of the conventional user content internet is becoming willfully polluted and ever more useless by the incessant influx of "ai"-garbage. At some point all of this will become so awful that nerds will create new and quiet corners of real people and real information while the idiot rabble has to use new and expensive tools peddled by scammy tech bros to handle the stench of automated manure that flows out of stagnant llms digesting themselves.

  • JohnFen 2 days ago

    > At some point all of this will become so awful that nerds will create new and quiet corners of real people and real information

    It's already happening. There is a growing number of groups that are forming their own "private internets" that is separated from the internet-at-large, precisely because the internet at large is becoming increasingly useless for a whole lot of valuable things.

  • biofox 2 days ago

    Most of the time, HN is that quiet corner. I just hope it stays that way.

shortrounddev2 2 days ago

Man the AI folks really wrecked everything. Reminds me of when those scooter companies started just dumping their scooters everywhere without asking anybody if they wanted this.

  • analog31 2 days ago

    perhaps germane to this thread, I think the scooter thing was an investment bubble. it was easier to burn investment money on new scooters than to collect and maintain old ones. until the money ran out.

  • kdmccormick 2 days ago

    At least scooters did something useful for the environment.

    • Sander_Marechal 2 days ago

      Did they? A lot of then were barely used, got damaged or vandalized, etc. And when the companies folded or communities outlawed the scooters, they end up as trash. I don't believe for a second that the amount of pollutants and greenhouse gasses saved by usage is larger than the amount produced by manufacturing, shipping and trashing all those scooters.

    • DrillShopper 2 days ago

      Their batteries on the other hand…

      • kdmccormick 2 days ago

        Sure, they're worse than walking or biking, but compared to an electric car battery or an ICE car?

        • Sharlin 2 days ago

          At least where I'm from, scooters have mostly replaced walking and biking, not car trips :(

yard2010 a day ago

> Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X

God I hate this dystopic timeline we live in.

syngrog66 2 days ago

A few years ago I began an effort to write a new tech book. I planned orig to do as much of it as I could across a series of commits in a public GitHub repo of mine.

I then changed course. Why? I had read increasing reports of human e-book pirates (copying your book's content then repackaging it for sale under a diff title, byline, cover, and possibly at a much lower or even much higher price.)

And then the rise of LLMs and their ravenous training ingest bots -- plagiarism at scale and potentially even easier to disguise.

"Not gonna happen." - Bush Sr., via Dana Carvey

Now I keep the bulk of my book material non-public during dev. I'm sure I'll share a chapter candidate or so at some point before final release, for feedback and publicity. But the bulk will debut all together at once, and only once polished and behind a paywall

cdrini a day ago

This has to be the most annoying hacker news comment section I've ever seen. It's just the same ~4 viewpoints rehashed again, and again, and again. Why don't folks just upvote other comments that say the same thing instead of repeating the same things?

And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring.

Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean.

Here's to a comment that will never be read because of all the noise in this thread :/

  • appendix-rock a day ago

    They want to display how they’re truly intelligent (unlike LLMs) by checks notes rehashing opinions that they’ve read millions of times online.

    Sound familiar to anyone?

    • wbillingsley a day ago

      I wonder whether future generations will be ingrained with a Truman Show fear that maybe only the few thousand people they meet are real and everything else is generated background noise.

      • Cthulhu_ a day ago

        I already get this when I look at e.g. youtube comments.

  • actionfromafar a day ago

    I read, but I can't say I like it. :-D People will ELI5 everything to understand it, no hard word understand necessary, up-goer-five-style, then "de-compress" it into floral (Amorphophallus Titanum scented) GPT speak when sending responses back out.

  • vlan121 a day ago

    You haven't read the whole thing. It says that: or that could benefit generative AI.

    • cdrini a day ago

      I did read it :) not sure how that line applies here, can you expand?

  • advael a day ago

    On a meta level I agree that having this kind of dataset with "before and after" would be pretty interesting. On an object level I do not predict that this would increase the overall diversity of language usage - and in fact it would be extremely surprising if this was even possible due to some general mathematical properties of neural networks - nor would "more professional writing," though I do agree with this characterization of the way AI-generated text sounds. The more I work with LLMs and encounter them in the wild, the greater my confidence that I can tell when something was generated, with the exception of B2B marketing copy and communications from HR departments or state agencies

    On the level of meta-discourse you seem to want to also speak to: Dang even when people have the Official Corporate Approved Perspective (in particular, the claim that it's "like being able to chat with someone that has an infinite vocabulary" is probably the silliest delusional AI hype I've heard all week) and the most upvotes in the thread they still think they're an embattled ideological minority. Starting to think that literally zero people in the modern world don't have or affect a victim complex of some kind

    • cdrini a day ago

      Haha I'm pleasantly surprised to see my comment at the top, I genuinely thought it would drown to the bottom! Not due to disagreement, just due to sheer volume and being posted rather late in this posts lifespan. Anyways my meta comment wasn't that I disagreed with all the other comments, I was just frustrated at how repetitive they were of one another. When I go to leave a comment, I do a pass reading through all or most of the comments to make sure someone hasn't left a comment in the same vein, and it was just frustrating to go through people saying almost verbatim the same thing others were saying! If your comment isn't adding something new why leave it? I'm all for healthy disagreement :) Also not sure what part of my post sounds like it's from an "embattled ideological minority".

      But speaking of healthy disagreement, as to "chatting with someone that has an infinite vocabulary", I'd love to hear any counterarguments you might have; or was calling it "silly and delusional" meant to be your argument? :P I think it's a pretty uncontroversial statement seeing as eg ChatGPT very likely knows every word in the English language.

      • advael a day ago

        The most ridiculous aspects for me were the anthropomorphizing (Reminds me of that one Sam Altman interview a bit) and the use of "infinite", which both doesn't really work on vibes (as many have noted, while I'm sure chatGPT has been exposed to every word, its pattern of communication is very "regression to the mean" among them), but also is silly if taken literally, because unless we're counting like some quirky technically-grammatical combinatoric compounding that we in practice infer the meaning of from composition of what we identify as separate individual words (like just hyphenating a bunch of adjectives and a noun or something) there's not really an argument for there being "infinite vocabulary" in the same sense that there is for "infinite possible sentences" because being a valid word requires at least that someone can meaningfully comprehend what is meant by it, and coordination requirements of this nature tend to truncate infinities

        The case for ChatGPT doing significant coinage that sticks isn't particularly strong either, partially from theory and partially because I'd think I'd've heard a lot of complaints about it by now, and the ones on hackernews would be repetitive to the point of seeming unavoidable (we agree on that for sure)

        Anyway, re: the silliest hype I've heard all week, I'm mostly just trying to find humor in what has been a pretty bad hype wave for someone who's pathologically bad at sounding like the kind of nontechnical hype guys that pervade any tech hype wave but is nonetheless mostly seeking out jobs in this field because it's what my expertise is in. Incredibly awful job market for a lot of people I realize, but it feels like a special hell I get for getting into ML research before it was (quite so) cool. I'm trying to fight the negativity but I've gotten screwed over a lot lately, but I don't have anything against you personally for being silly on hn

        • cdrini an hour ago

          Ah ok so anthropomorphizing and the phrase "infinite vocabulary" sounds impossible. I agree infinite vocabulary is a bit murky, and mathematically incorrect. If I wanted to be more mathematically correct I could say complete vocabulary, but I think that's actually a little less understandable to people. I did not mean infinite vocabulary in that it coins new words, just infinite as in very large to the point of being incomprehensibly large by a single individual. As per anthropomorphizing, I think the word "chat" is the most anthropomorphizing I did, so don't agree with you on that one.

          Ah mate sorry to hear that, the market is tough right now. I will say objectively I believe there's very little in my comment that's hype-y. I think using AI while reading documents out of your comfort zone, and asking it questions can expand your vocabulary. I've personally tried it, it's helped me read papers not in my field, it's helped me find papers for better research. I can understand how someone can disagree with that, but calling it hype sounds to me more like a response to an invisible enemy/to "all the ones who hyped before" than to an actual concrete response to this specific case. And I think that mentality could put you in a potential catch-22 mental loop that will leave you constantly dissatisfied with anything AI or ML, by constantly seeing this invisible enemy where it might not be present. Anyways, stay positive and best of luck with the job hunt!

          Edit: and it looks like my comment has now fallen deep into the depths of the comment thread, never to be heard from again! See, I told you I was an embattled ideological minority ;)

      • mark-r a day ago

        Sure, ChatGPT knows every word in the English language (and probably quite a few that ain't). But how likely is it to use them all?

        • cdrini 41 minutes ago

          Now that's an argument! Agreed, it won't use them of its own accord, but the fact that you can ask it about words, or ask it even to break down important words in a new field, or give it a paragraph from a paper not in your field and have it explain the jargon, I think that's how it can help someone grow their vocabulary.

  • BiteCode_dev a day ago

    Also it breaks the languagr barreer, you can now read the Chinese internet if you want, or chat transparently in Arabic. That's going to be interesting.

    • Cthulhu_ a day ago

      At the moment though (and ever since decent online translation services were a thing), it feels one-way, that is, people from that side of the internet coming to the anglosphere internet moreso than anglosphere people going internet-abroad. I may be wrong.

      • BiteCode_dev a day ago

        As a Frenchman, I learned very quickly that my language sphere market and resource pool is so much smaller than the English one that it's 10 times less effective to do anything in it.

        I understand the position.

        The only exception would be China, but the GFW is probably not helping.

        LLV might lower the cost of that so much that it will become more interesting to do so.

  • [removed] a day ago
    [deleted]
adr1an a day ago

I guess curating unpolluted text is one of the new jobs GenAI created? /s

[removed] a day ago
[deleted]
next_xibalba 2 days ago

[flagged]

  • Ensorceled 2 days ago

    > Also, it is shocking how authoritarian the “left” has become in my lifetime.

    We are going through a general uptick in authoritarian "discussions" online. It's interesting that you are only seeing it on the "left".

    • Lerc 2 days ago

      Perhaps they only find the increase shocking on the left.

      There's a bunch of ways to measure political opinions. The Authoritarian-liberal one being one of many. The economic-left and the economic-right are becoming more separated from the social-left and the social-right.

      Tribalism also causes people to take on the positions of their 'tribe' which may be distinct to what their own personalities might normally gravitate to.

      In the past, it has been the economic-left and social-right that were more prone to authoritarianism with their proponents believing that their ideals should be enforced.

      The economic-right and social-left was more of a logic vs empathy tension ('this works' vs 'this is right') and a lot of people seemed to reconcile the two for one flavour of centrism.

      To me it is a little shocking how authoritarian elements of the social-left have become, an ideology that has long been characterized by empathy and supporting others seems to have become blended with opinions which are exclusionary or dogmatic, which seem counter to their own principles.

      In some respects maybe this is just the march of time making the progressive opinions of one generation the orthodoxy of the next and these people are just finding a new conservatism rooted in a new orthodoxy.

      • Eisenstein 2 days ago

        Isn't that just what happens when you keep pushing the Overton Window to the right? What would have been 'centrists' have to become more authoritarian to stand ground or else they let their position get absorbed by the stronger leaning side. When one side refuses to compromise even slightly, you have two options: give in or dig in.

    • Diti 2 days ago

      As far as I noticed, the “right” effectively gets the boot in most online communities which abide to a Code of Conduct, leaving mostly the “left” (the most recent example I have in mind of such moderation efforts is the save-nix-together.org open letter). It’s interesting that you don’t notice this happening in the communities you seem to frequent.

      • Ensorceled 2 days ago

        No, I see a lot of the "right" discourse. Many are openly supporting Putin now. I follow many conservative (US) pundits and journalists and they have either taken a hard right turn or are raising the alarm and supporting Harris. I see similar trends here in Canada.

        Yes, I see that the left has become more authoritarian, but it pales to the hard shift I see on the right.

    • acheong08 2 days ago

      They said nothing about it being “only” on the left.

      I somewhat expect authoritarianism on the right and therefore would hold the left (to which I belong) at a higher standard.

      • Ensorceled 2 days ago

        Authoritarianism on the right is becoming its mainstream, while authoritarianism on the left is merely on the rise.

    • robertlagrant 2 days ago

      > It's interesting that you are only seeing it on the "left".

      One explanation is that now things have switched round, and people with left-wings beliefs, sometimes extremely life-wing beliefs, control a lot of institutions and structures. People who are my age (too old) grew up with the right being in that position, but I don't think that's a contemporary instinct to possess.

  • Miraltar 2 days ago

    It did feel emotive but this wasn't the main point. Data is harder to get (or more expansive) and more polluted.

    • thomasfromcdnjs 2 days ago

      Felt super emotive to me, the problems the author is outlining, a) might not be an actual problems b) just require new thinking to solve

      • albedoa 2 days ago

        The problems are well-known and highly-documented. You should leave the determination of (b) up to those who know and understand (a), which includes the author.

    • algaeselect 2 days ago

      It does poison the article, when someone is talking about subject X, and then they feel the need to insert their political opinion about person Y and that dont' like that Z is left-wing/right-wing. It makes them seem not at all objective, and calls into doubt what else they are not being objective about in their article.

  • Jcampuzano2 2 days ago

    Even if this is true it being an emotional decision, so much of twitter/X itself is now AI slop anyway, so it'd be worth it to just not even include it whether it was right wing or not.

    Regardless, the owner is well within his right to make an emotional decision based on his beliefs to stop anyway.

  • weweweoo 2 days ago

    Yeah, I'm no fan of Musk or Trump, but I think Twitter always was a spam-infested, hateful cesspool where people with online-addiction yelled at each other. There was nothing for Musk to ruin, because the whole concept was rotten from the start. Allowing only short messages doesn't promote intelligent discussion, it does the opposite.

  • hhh 2 days ago

    I don't care about the political part, but twitter used to have nice, high-enough quality expert discussion around topics, and now its just a shithole with very stupid takes polluting the few spots free in replies between engagement farmers and llm slop-spewers

jaimex2 2 days ago

[flagged]

  • x3ro 2 days ago

    Do you mind explicitly saying what views and what "mainstream media" you are referencing here?

  • [removed] 2 days ago
    [deleted]
  • hluska 2 days ago

    I don’t know how, but you manage to consistently flood hit this site with garbage content. And while there is a lot of it, your content is poor enough that I remember you.

    You’re not edgy dude. Give it up.

hoseja 2 days ago

>"Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X. Even if X made its raw data feed available (which it doesn't), there would be no valuable information to be found there.

>Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.

>And given what's happening to the field, I don't blame them."

What beautiful doublethink.

  • mschuster91 2 days ago

    > What beautiful doublethink.

    Given just how many AI bots scrape up everything they can, oftentimes ignoring robots.txt or any rate limits (there have been a few complaint threads on HN about that), I can hardly blame the operators of large online services just cutting off data feeds.

    Twitter however didn't stop their data feeds due to AI or because they wanted money, they stopped providing them because its new owner does everything he can to hinder researchers specializing in propaganda campaigns or public scrutiny.

    • hluska 2 days ago

      What was Reddit’s excuse? They did roughly the same thing (and have just as much garbage content).

      In other words, why is it wrong for X but okay for Reddit? If you ignore one individual’s politics, the two services did the same thing.

      • mschuster91 2 days ago

        Reddit shut their API access down only very recently, after the AI craze went off. Twitter did so right after Musk took over, way before Reddit, way before AI ever went nuts.

        • dotnet00 2 days ago

          X shut down API access in Feb 2023, Reddit shut theirs down at the end of June of the same year. Just barely 6 months apart.

          Furthermore, while X had also only announced this in February, Reddit announced their API shutdown just 2 months later in April.

          And, to further add to that, X was pretty upfront that they think they have access to a large and powerful dataset in X and didn't want to give it out for free. Reddit used very similar wording when announcing their changes.

QRe 2 days ago

I understand the frustration shared in this post but I wholeheartedly disagree with the overall sentiment that comes with it.

The web isn't dead, (Gen)AI, SEO, spam and pollution didn't kill anything.

The world is chaotic and net entropy (degree of disorder) of any isolated or closed system will always increase. Same goes for the web. We just have to embrace it and overcome the challenges that come with it.

  • ryukoposting 2 days ago

    I'm not so optimistic. The most basic requirements are:

    1. Prove the human-ness of an author... 2. ...without grossly encroaching on their privacy. 3. Ensure that the author isn't passing off AI-generated material as their own.

    We'll leave out the "don't let AI models train on my data" part for now.

    Whatever solution we come up with, if any, will necessarily be mired in the politics of privacy, anonymity, and/or DRM. In any case, it's hard to conceive of a world where the human web returns as we once knew it.

    • vundercind a day ago

      The good news—such as it is—is that the Web never really became what we assumed it surely would in its early days.

      If it was never really the case that you’d be better off for serious or improving reading having only the Web versus only access to a decent library, then we haven’t lost something so precious.

      I mean, the most valuable site on the Web is probably a book & research paper piracy website. That’s its crowning achievement. Faster interlibrary loan, basically, but illegal.

  • brunokim 2 days ago

    Here is an expert saying there is a problem and how it killed its research effort, and yet you say that things are the same as ever and nothing was killed.

    • QRe a day ago

      1. I am not discrediting the expert in any way, if anything, I think their decision to quit is understandable - there is now a challenge that arose during his research that is not in their interest to pursue (information pollution is not research in corpus linguistics / NLP).

      2. I never said that things are the same as ever, quite the opposite actually. I am saying the world evolves constantly. It's naive to say company X/Y/Z killed something or made something unusable, when there is constant inevitable change. We should focus on how to move forward giving this constraint, and not dwell on times where the web was so much 'cleaner' and 'nicer', more manageable etc.