Comment by voytec

Comment by voytec 2 days ago

92 replies

I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

doe_eyes 2 days ago

> I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.

In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.

  • weinzierl 2 days ago

    Isn't it the other way around?

    SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.

    LLM content should just enhance and cement the status quo word frequencies.

    Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.

    • mlsu 2 days ago

      But you can already see it with Delve. Mistral uses "delve" more than baseline, because it was trained on GPT.

      So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...

      Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).

      Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.

      • bryanrasmussen a day ago

        is the use of miscible here a clue? Or just some workplace vocabulary you've adapted analogically?

      • taneq 2 days ago

        > LLM uses delve more, delve appears in training data more, LLM uses delve more...

        Some day we may view this as the beginnings of machine culture.

    • derefr 2 days ago

      1. People don't generally use the (big, whole-web-corpus-trained) general-purpose LLM base-models to generate bot slop for the web. Paying per API call to generate that kind of stuff would be far too expensive; it'd be like paying for eStamps to send spam email. Spambot developers use smaller open-source models, trained on much smaller corpuses, sized and quantized to generate text that's "just good enough" to pass muster. This creates a sampling bias in the word-associational "knowledge" the model is working from when generating.

      2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.

      3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)

      • travisjungroth a day ago

        On point 1, that’s surprising to me. A 2,000 word blog post would be 10 cents with GPT-4o. So you put out 1,000 of them, which is a lot, for $100.

    • tigerlily a day ago

        Too deep we delved, and awoke the ancient delves.
    • lbhdc 2 days ago

      > LLM content should just enhance and cement the status quo word frequencies.

      TFA mentions this hasn't been the case.

      • flakiness 2 days ago

        Would you mind dropping the link talking about this point? (context: I'm a total outsider and have no idea what TFA is.)

bondarchuk 2 days ago

At some point though you have to acknowledge that a specific use of language belongs to the medium through which you're counting word frequencies. There are also specific writing styles (including sentence/paragraph sizes, unnecessary repetitions, focusing on other metrics than readability) associated with newspapers, novels, e-mails to your boss, anything really. As long as text was written by a human who was counting on at least some remote possibility that another human might read it, this is way more legitimate use of language than just generating it with a machine.

kevindamm 2 days ago

Yes but not quite as far as you imply. The training data is weighted by a quality metric, articles written by journalists and wikipedia contributors are given more weight than Aunt May's brownie recipe and corpoblogspam.

  • jsheard 2 days ago

    > The training data is weighted by a quality metric

    At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.

    Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

    • acdha 2 days ago

      > Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

      Google has been _monetizing_ the SEO game forever. They chose not to act against many notorious actors because the metric they optimize for is ad revenue and and those sites were loaded with ads. As long as advertisers didn’t stop buying, they didn’t feel much pressure to make big changes.

      A smaller company without that inherent conflict of interest in its business model can do better because they work on a fundamentally different problem.

    • derefr 2 days ago

      > those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet

      The problem is that, of the signals you mention,

      • the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")

      • and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.

      Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.

      And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.

      Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)

    • nneonneo a day ago

      Reminds me of a Google search I did yesterday: “Hezbollah” yields a little info box with headings “Overview”, “History”, “Apps” and “Return policy”.

      I’m guessing that the association between “pagers” and “Hezbollah” ended up creating the latter two tabs, but who knows. Maybe some AI video out there did a product review of Hezbollah.

    • Suppafly 2 days ago

      >At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight.

      I've noticed that lately. It used to be the top google result was almost always what you needed. Now at the top is an AI summary that is pretty consistently wrong, often in ways that aren't immediately obvious if you aren't familiar with the topic.

    • epgui 2 days ago

      I don’t think they were talking about the quality of Google search results. I believe they were talking about how the data was processed by the wordfreq project.

      • kevindamm 2 days ago

        I was actually referring to the data ingestion for training LLMs, I don't know what filtering or weighting might be done with wordfreq.

    • noirscape 2 days ago

      Google has those problems because the company's revenue source (Ads) and the thing that puts it on the map (Search) are fundamentally at odds with one another.

      A useful Search would ideally send a user to the site with the most signal and the fewest noise. Meanwhile, ads are inherently noise; they're extra pieces of information inserted into a webpage that at best tangentially correlate to the subject of a page.

      Up until ~5 years ago, Google was able to strike a balance on keeping these two stable; you'd get results with some Ads but the signal generally outweighed the noise. Unfortunately from what I can tell from anecdotes and courtroom documents, the Ad team at Google has essentially hijacked every other aspect of the company by threatening that yearly bonuses won't be given out if they don't kowtow to the Ad teams wishes to optimize ad revenue somewhere in 2018-2019 and has no sign of stopping since there's no effective competition to Google. (There's like, Bing and Kagi? Nobody uses Bing though and Kagi is only used by tech enthusiasts. The problem with Google is that to copy it, you need a ton of computing resources upfront and are going up against a company with infinitely more money and ability to ensure users don't leave their ecosystem; go ahead and abandon Search, but good luck convincing others to give up say, their Gmail account, which keeps them locked to Google and Search will be there, enticing the average user.)

      Google has absolutely zero incentive to filter out generative AI junk from their search results outside the amount of it that's damaging their PR since most of the SEO spam is also running Google Ads (since unless you're hosting adult content, Google's ad network is practically the only option). Their solution therefore isn't to remove the AI junk, but to instead reduce it enough to the degree where a user will not get the same type of AI junk twice.

      • PaulHoule 2 days ago

        My understanding is that Google Ads are what makes Google Search unassailable.

        A search engine isn't a two-sided market in itself but the ad network that supports it is. A better search engine is a technological problem, but a decently paying ad network is a technological problem and a hard marketing problem.

  • Freak_NL 2 days ago

    It certainly feels like the amount of regurgitated, nonsensical, generated content (nontent?) has risen spectacularly specifically in the past few years. 2021 sounds about right based on just my own experience, even though I can't point to any objective source backing that up.

    • eszed 2 days ago

      Upvoted for "nontent" alone: it'll be my go-to term from now on, and I hope it catches on.

      Is it of your own coinage? When the AI sifts through the digital wreckage of the brief human empire, they may give you the credit.

    • zharknado 2 days ago

      Ooh I like “nontent.” Nothing like a spicy portmanteau!

    • eptcyka 2 days ago

      I personally am yet to see this beyond some slop on youtube. And I am here for the AI meme videos. I recognize the dangers of this, all I am saying is that I don't feel the effect, yet.

      • Freak_NL 2 days ago

        I'm seeing it a lot when searching for some advice in a well-defined subject, like, say, leatherworking or sewing (or recipes, obviously). Instead of finding forums with hobbyists, in-depth blog posts, or manufacturers advice pages, increasingly I find articles which seem like natural language at first, but are composed of paragraphs and headers repeating platitudes and basic tips. It takes a few seconds to realize the site is just pushing generated articles.

        Increasingly I find that for in-depth explanations or tutorials Youtube is the only place to go, but even there the search results can lead to loads of videos which just seem… off. But at least those are still made by humans.

      • ghaff 2 days ago

        There's been a ton of low-rent listicle writing out there for ages. Certainly not new in the past few years. I admit I don't go on YouTube much and don't even have a tiktok account so it's possible there's a lot of newer lousy content I'm not really exposed to.

        It seems to me that the fact it's so cheap and relatively easy for people with dreams of becoming wealthy influencers to put stuff out there has more to do with the flood of often mediocre content than AI does.

        Of course the vast majority don't have much real success and get on with life and the crank turns and a new generation perpetuates the cycle.

        LLMs etc. may make things marginally easier but there's no shortage of twenty somethings with lots of time imagining riches while making pennies.

      • sharpshadow a day ago

        Looking forward to watch perfect generated videos. We need so much more power and chips but it’s completely worth it. After that? Maybe generated videogames. But the video stuff will be awesome and changing the video dominated social media content for ever. Virtual headsets will become useful finally generating anything you want to see and jump tru space and time.

    • jsheard 2 days ago

      SEO grifters have fully integrated AI at this point, there are dozens of turn-key "solutions" for mass-producing "content" with the absolute minimum effort possible. It's been refined to the point that scraping material from other sites, running it through the LLM blender to make it look original, and publishing it on a platform like Wordpress is fully automated end-to-end.

      • sahmeepee 2 days ago

        Or check out "money printer" on github: a tongue in cheek mashup of various tools to take a keyword as input and produce a youtube video with subtitles and narration as output.

  • darby_nine 2 days ago

    Aunt may's brownie recipe (or at least her thoughts on it) are likely something you'd want if you want to reflect how humans use language. Both news-style and encyclopedia-style writing represent a pretty narrow slice.

    • creshal 2 days ago

      That's why search engines rated them highly, and why a million spam sites cropped up that paid writers $1/essay to pretend to be Aunt May, and why today every recipe website has a gigantic useless fake essay in front of their copypasted made up recipes.

      • Freak_NL 2 days ago

        I hate how looking for recipes has become so… disheartening. Online recipes are fine for reputable sources like newspapers where professional recipe writers are paid for their contributions, but searching for some Aunt May's recipe for 'X' in the big ocean of the internet is pointless — too much raw sewage dumped in.

        It sucks, because sharing recipes seemed like one of those things the internet could be really good at.

      • darby_nine 2 days ago

        Ok, but what i said is true regardless of SEO, and that SEO has also fed back into english before LLMs were a thing. If you only train on those subsets you'll also end up with a chatbot that doesn't speak in a way we'll identify as natural english.

  • Lalabadie 2 days ago

    The current state of things leads me to believe that Google's current ranking system has been somehow too transparent for the last 2-3 years.

    The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.

sahmeepee 2 days ago

Prior to Google we had Altavista and in those days it was incredibly common to find keywords spammed hundreds of times in white text on a white background in the footer of a page. SEO spam is not new, it's just different.

rockskon a day ago

Don't forget Google's adsense rules which penalized useful straightforward websites and mandated websites be full of "content". Doesn't matter if the "content" is garbage nonsense rambling and excessive word use - it's content and much more likely to be okayed by adsense!

ToucanLoucan 2 days ago

This feels like a second, magnitudes larger Eternal September. I wonder how much more of this the Internet can take before everyone just abandons it entirely. My usage is notably lower than it was in even 2018, it's so goddamn hard to find anything worth reading anymore (which is why I spend so much damn time here, tbh).

  • wpietri 2 days ago

    I think it's an arms race, but it's an open question who wins.

    For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).

    • jsheard 2 days ago

      The fight against spam email also led to mass consolidation of what was supposed to be a decentralised system though. Monoliths like Google and Microsoft now act as de-facto gatekeepers who decide whether or not you're allowed to send emails, and there's little to no transparency or recourse to their decisions.

      There's probably an analogy to be made about the open decentralised internet in the age of AI here, if it gets to the point that search engines have to assume all sites are spam by default until proven otherwise, much like how an email server is assumed guilty until proven innocent.

    • jerf 2 days ago

      Another problem with this arms race is that spam emails actually are largely separable from ham emails for most people... or at least they were, for most of their run. The thousandth email that claims the UN has set aside money for me due to my non-existent African noble ancestry that they can't find anyone to give it to and I just need to send the Thailand embassy some money to start processing my multi-million yuan payout and send it to my choice of proxy in Colombia to pick it up is quite different from technical conversation about some GitHub issue I'm subscribed to, on all sorts of metrics.

      However, the frontline of the email war has shifted lately. Now the most important part of the war is being fought over emails that look just like ham, but aren't. Business frauds where someone convinces you that they are the CEO or CFO or some VP and they need you to urgently buy this or that for them right now no time to talk is big business right now, and before you get too high-and-mighty about how immune you are to that, they are now extremely good at looking official. This war has not been won yet, and to a large degree, isn't something you necessarily win by AI either.

      I think there's an analogy here to the war on content slop. Since what the content slop wants is just for you to see it so they can serve you ads, it doesn't need anything else that our algorithms could trip on, like links to malware or calls to action to be defrauded, or anything else. It looks just like the real stuff, and telling that it isn't could require a human rather vast amounts of input just to be mostly sure. Except we don't have the ability to authenticate where it came from. (There is no content authentication solution that will work at scale. No matter how you try to get humans to "sign their work" people will always work out how to automate it and then it's done.) So the one good and solid signal that helps in email is gone for general web content.

      I don't judge this as a winning scenario for the defenders here. It's not a total victory for the attackers either, but I'd hesitate to even call an advantage for one side or the other. Fighting AI slop is not going to be easy.

    • ToucanLoucan 2 days ago

      > but spammers mostly lost that arms race

      I'm not saying this is impossible but that's going to be an uphill sell for me as a concept. According to some quick stats I checked I'm getting roughly 600 emails per day, about 550 of which go directly to spam filtering, and of the remaining 50, I'd say about 6 are actually emails I want to be receiving. That's an impressive amount overall for whoever built this particular filter, but it's also still a ton of chaff to sort wheat from and as a result I don't use email much for anything apart from when I have to.

      Like, I guess that's technically usable, I'm much happier filtering 44 emails than 594 emails? But that's like saying I solved the problem of a flat tire by installing a wooden cart wheel.

      It's also worth noting there that if I do have an email thats flagged as spam that shouldn't be, I then have to wade through a much deeper pond of shit to go find it as well. So again, better, but IMO not even remotely solved.

      • dhosek 2 days ago

        I’m not sure what you’ve done to get that level of spam, but I get about 10 spam emails a day at most and that’s across multiple accounts including one that I’ve used for almost 30 years and had used on Usenet which was the uber-spam magnet. A couple newer (10–15 year old) addresses which I’ve published on webpages with mailto links attract maybe one message a week and one that I keep for a specialized purpose (fiction and poetry submissions) gets maybe one to two messages per year, mostly because it’s of the form example@example.com so easily guessed by enterprising spammers.

        Looking at the last days’ spam¹ I have three 419-style scams (widows wanting to give away their dead husbands’ grand piano or multi-million euro estate) and three phishing attempts. There are duplicate messages in each category.

        About fifteen years ago, I did a purge of mailing list subscriptions and there’s very little that comes in that I don’t want, most notably a writer who’s a nice guy, but who interpreted my question about a comment he made on a podcast as an invitation to be added to his manually managed email list and given that it’s only four or five messages a year, I guess I can live with that.

        1. I cleaned out spam yesterday while checking for a confirmation message from a purchase.

      • wpietri 2 days ago

        I'm having a hard time finding reliably sourced statistics here, but I suspect you're an outlier. My personal numbers are way better, both on Gmail and Fastmail, despite using the same email addresses for decades.

    • pyrale 2 days ago

      > but spammers mostly lost that arms race.

      Advertising in your mails isn't Google's.

  • [removed] 2 days ago
    [deleted]
  • BeFlatXIII 2 days ago

    I hope this trend accelerates to force us all into grass-touching and book-reading. The sooner, the better.

    • MrLeap 2 days ago

      Books printed before 2018, right?

      I already find myself mentally filtering out audible releases after a certain date unless they're from an author I recognize.

redbell 2 days ago

> ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.

pphysch 2 days ago

It's crazy to attribute the downfall of the web/search to Google. What does Google have to do with all the genuine open web content, Google's source of wealth, getting starved by (increasingly) walled gardens like Facebook, Reddit, Discord?

I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.

krelian 2 days ago

>And yet LLMs were still fed articles written for Googlebot, not humans.

How do we know what content LLMs were fed? Isn't that a highly guarded secret?

Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?

  • GTP 2 days ago

    We do know that the open web consitutes the bulk of the trainig data, although we don't get to know the specific webpages that got used. Plus some more selected sources, like books, of which again we only know that those are books but not which books were used. So it's just a matter of probability that there was a good amount of SEO spam as well.