Comment by weinzierl

Comment by weinzierl 2 days ago

24 replies

Isn't it the other way around?

SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.

LLM content should just enhance and cement the status quo word frequencies.

Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.

mlsu 2 days ago

But you can already see it with Delve. Mistral uses "delve" more than baseline, because it was trained on GPT.

So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...

Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).

Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.

  • bryanrasmussen a day ago

    is the use of miscible here a clue? Or just some workplace vocabulary you've adapted analogically?

    • mlsu a day ago

      Human me just thought it was a good word for this. It implies some irreversible process of mixing, I think that characterizes this process really well.

      • noduerme a day ago

        There were dozens of 20th Century ideological movements which developed their own forms of "Newspeak" in their own native languages. Largely, natural human dialog between native speakers and between those opposed to the prevailing regime recoils violently at stilted, official, or just "uncool" usages in daily vernacular. So I wouldn't be too surprised to see a sharp downtick in the popular use of any word that becomes subject to an LLM's positive-feedback loop.

        Far from saying the pool of language is now polluted, I think we now have a great data set to begin to discern authentic from inauthentic human language. Although sure, people on the fringes could get caught in a false positive for being bots, like you or I.

        The biggest LLM of them all is the daily driver of all new linguistic innovation: Human society, in all its daily interactions. The quintillions of daily phrases exchanged and forever mutating around the globe - each mutation of phrase interacting with its interlocutor, and each drawing from not the last 500,000 tokens but the entire multi-modal, if you will, experience of each human to date in their entire lives - vastly eclipses anything any hardware could ever emulate given the current energy constraints. Software LLMs are just a state machine stuck in a moment in time. At best they will always lag, the way Stalinist language lagged years behind the patois of average Russians, who invented daily linguistic dodges to subvert and mock the regime. The same process takes place anywhere there is a dominant official or uncool accent or phrasing. The ghetto invents new words, new rhythm, and then it becomes cool in the middle class. The authorities never catch up, precisely because the use of subversive language is humanity's immune system against authority.

        If there is one distinctly human trait, it's sniffing out anyone who sounds suspiciously inauthentic. (Sadly, it's also the trait that leads to every kind of conspiracy theorizing imaginable; but this too probably confers in some cases an evolutionary advantage). Sniffing out the sound of a few LLMs is already happening, and will accelerate geometrically, much faster than new models can be trained.

        • bryanrasmussen a day ago

          humans also lag humans, the future may already be spoken, but the slang is not evenly memed out yet.

    • jazzyjackson a day ago

      If you think that's niche wait til you hear about man-machine miscegenation

  • taneq 2 days ago

    > LLM uses delve more, delve appears in training data more, LLM uses delve more...

    Some day we may view this as the beginnings of machine culture.

    • mlsu 2 days ago

      Oh no, it's been here for quite a while. Our culture is already heavily glued to the machine. The way we express ourselves, the language we use, even our very self-conception originates increasingly in online spaces.

      Have you ever seen someone use their smartphone? They're not "here," they are "there." Forming themselves in cyberspace -- or being formed, by the machine.

derefr 2 days ago

1. People don't generally use the (big, whole-web-corpus-trained) general-purpose LLM base-models to generate bot slop for the web. Paying per API call to generate that kind of stuff would be far too expensive; it'd be like paying for eStamps to send spam email. Spambot developers use smaller open-source models, trained on much smaller corpuses, sized and quantized to generate text that's "just good enough" to pass muster. This creates a sampling bias in the word-associational "knowledge" the model is working from when generating.

2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.

3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)

  • travisjungroth a day ago

    On point 1, that’s surprising to me. A 2,000 word blog post would be 10 cents with GPT-4o. So you put out 1,000 of them, which is a lot, for $100.

    • derefr 20 hours ago

      There are two costs associated with using a hosted inference platform: the OpEx of API calls, and the CapEx of setting up an account in the first place. This second cost is usually trivial, as it just requires things any regular person already has: an SSO account, a phone number for KYC, etc.

      But, insofar as your use-case is against the TOUs of the big proprietary inference platforms, this second cost quickly swamps the first cost. They keep banning you, and you keep having to buy new dark-web credentials to come back.

      Given this, it’s a lot cheaper and more reliable — you might summarize these as “more predictable costs” — to design a system around a substrate whose “immune system” won’t constantly be trying to kill the system. Which means either your own hardware, or a “being your own model” inference platform like RunPod/Vast/etc.

      (Now consider that there are a bunch of fly-by-night BYO-model hosted inference platforms, that are charging unsustainable flat-rate subscription prices for use of their hardware. Why do these exist? Should be obvious now, given the facts already laid out: these are people doing TOU-violating things who decided to build their own cluster for doing them… and then realized that they had spare capacity on that cluster that they could sell.)

      • travisjungroth 18 hours ago

        This makes sense. But now I’m wondering if people here are speaking from experience or reasoning their way into it. Like are there direct reports of which models people are using for blogspam, or is it just what seems rational?

    • brazzy a day ago

      But then you'll be competing for clicks with others who put out 1,000,000 posts for less costs because they used a small, self hosted model.

      • baq a day ago

        if you are a sales & marketing intern, have a potato laptop and $100 budget to spend on seo, you aren't going to be self hosting anything even if you know what that means.

        • nerdponx a day ago

          This is about high-volume blog/news-spam created specifically to serve ads and affiliate links, not about occasional content marketing for legitimate companies.

lbhdc 2 days ago

> LLM content should just enhance and cement the status quo word frequencies.

TFA mentions this hasn't been the case.

  • flakiness 2 days ago

    Would you mind dropping the link talking about this point? (context: I'm a total outsider and have no idea what TFA is.)

    • girvo 2 days ago

      TFA means "the featured article", so in this case the "Why wordfreq will not be updated" link we're talking about.

      • adastra22 2 days ago

        To be pedantic, the F in TFA has the same meaning as the F in RTFM.

        It’s the same origin. On Slashdot (the HN of the early 00’s) people would admonish others to RTFA. Then they started using it as a referent: TFA was the thing you were supposed to have read.

      • jnordwick a day ago

        The Fucking Article, from RTFA - Read the Fucking Article - and RTFM - Read the Fucking Manual/Manpage