Why wordfreq will not be updated

1707 points by tomthe 10 months ago

voytec 10 months ago

I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

Reply View 98 replies

doe_eyes 10 months ago

> I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.
Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.
In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.

Reply View | 32 replies
- weinzierl 10 months ago
  
  Isn't it the other way around?
  SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.
  LLM content should just enhance and cement the status quo word frequencies.
  Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.
  
  Reply View | 29 replies
  
  mlsu 10 months ago
  
  But you can already see it with Delve. Mistral uses "delve" more than baseline, because it was trained on GPT.
  So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...
  Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).
  Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.
  
  Reply View | 13 replies
  
  derefr 10 months ago
  
  1. People don't generally use the (big, whole-web-corpus-trained) general-purpose LLM base-models to generate bot slop for the web. Paying per API call to generate that kind of stuff would be far too expensive; it'd be like paying for eStamps to send spam email. Spambot developers use smaller open-source models, trained on much smaller corpuses, sized and quantized to generate text that's "just good enough" to pass muster. This creates a sampling bias in the word-associational "knowledge" the model is working from when generating.
  2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.
  3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)
  
  Reply View | 6 replies
  
  lbhdc 10 months ago
  
  > LLM content should just enhance and cement the status quo word frequencies.
  TFA mentions this hasn't been the case.
  
  Reply View | 6 replies
  
  tigerlily 10 months ago
  
  Too deep we delved, and awoke the ancient delves.
  
  Reply View | 0 replies
- brudgers 10 months ago
  
  serpent eating its own tail
  GOGI.
  
  Reply View | 1 reply
  
  romwell 10 months ago
  
  The Inhuman Centipede
  
  Reply View | 0 replies
bondarchuk 10 months ago

At some point though you have to acknowledge that a specific use of language belongs to the medium through which you're counting word frequencies. There are also specific writing styles (including sentence/paragraph sizes, unnecessary repetitions, focusing on other metrics than readability) associated with newspapers, novels, e-mails to your boss, anything really. As long as text was written by a human who was counting on at least some remote possibility that another human might read it, this is way more legitimate use of language than just generating it with a machine.

Reply View | 0 replies
ToucanLoucan 10 months ago

This feels like a second, magnitudes larger Eternal September. I wonder how much more of this the Internet can take before everyone just abandons it entirely. My usage is notably lower than it was in even 2018, it's so goddamn hard to find anything worth reading anymore (which is why I spend so much damn time here, tbh).

Reply View | 10 replies
- wpietri 10 months ago
  
  I think it's an arms race, but it's an open question who wins.
  For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).
  
  Reply View | 6 replies
  
  jsheard 10 months ago
  
  The fight against spam email also led to mass consolidation of what was supposed to be a decentralised system though. Monoliths like Google and Microsoft now act as de-facto gatekeepers who decide whether or not you're allowed to send emails, and there's little to no transparency or recourse to their decisions.
  There's probably an analogy to be made about the open decentralised internet in the age of AI here, if it gets to the point that search engines have to assume all sites are spam by default until proven otherwise, much like how an email server is assumed guilty until proven innocent.
  
  Reply View | 0 replies
  
  jerf 10 months ago
  
  Another problem with this arms race is that spam emails actually are largely separable from ham emails for most people... or at least they were, for most of their run. The thousandth email that claims the UN has set aside money for me due to my non-existent African noble ancestry that they can't find anyone to give it to and I just need to send the Thailand embassy some money to start processing my multi-million yuan payout and send it to my choice of proxy in Colombia to pick it up is quite different from technical conversation about some GitHub issue I'm subscribed to, on all sorts of metrics.
  However, the frontline of the email war has shifted lately. Now the most important part of the war is being fought over emails that look just like ham, but aren't. Business frauds where someone convinces you that they are the CEO or CFO or some VP and they need you to urgently buy this or that for them right now no time to talk is big business right now, and before you get too high-and-mighty about how immune you are to that, they are now extremely good at looking official. This war has not been won yet, and to a large degree, isn't something you necessarily win by AI either.
  I think there's an analogy here to the war on content slop. Since what the content slop wants is just for you to see it so they can serve you ads, it doesn't need anything else that our algorithms could trip on, like links to malware or calls to action to be defrauded, or anything else. It looks just like the real stuff, and telling that it isn't could require a human rather vast amounts of input just to be mostly sure. Except we don't have the ability to authenticate where it came from. (There is no content authentication solution that will work at scale. No matter how you try to get humans to "sign their work" people will always work out how to automate it and then it's done.) So the one good and solid signal that helps in email is gone for general web content.
  I don't judge this as a winning scenario for the defenders here. It's not a total victory for the attackers either, but I'd hesitate to even call an advantage for one side or the other. Fighting AI slop is not going to be easy.
  
  Reply View | 0 replies
  
  ToucanLoucan 10 months ago
  
  > but spammers mostly lost that arms race
  I'm not saying this is impossible but that's going to be an uphill sell for me as a concept. According to some quick stats I checked I'm getting roughly 600 emails per day, about 550 of which go directly to spam filtering, and of the remaining 50, I'd say about 6 are actually emails I want to be receiving. That's an impressive amount overall for whoever built this particular filter, but it's also still a ton of chaff to sort wheat from and as a result I don't use email much for anything apart from when I have to.
  Like, I guess that's technically usable, I'm much happier filtering 44 emails than 594 emails? But that's like saying I solved the problem of a flat tire by installing a wooden cart wheel.
  It's also worth noting there that if I do have an email thats flagged as spam that shouldn't be, I then have to wade through a much deeper pond of shit to go find it as well. So again, better, but IMO not even remotely solved.
  
  Reply View | 2 replies
  
  pyrale 10 months ago
  
  > but spammers mostly lost that arms race.
  Advertising in your mails isn't Google's.
  
  Reply View | 0 replies
- [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies
- BeFlatXIII 10 months ago
  
  I hope this trend accelerates to force us all into grass-touching and book-reading. The sooner, the better.
  
  Reply View | 1 reply
  
  MrLeap 10 months ago
  
  Books printed before 2018, right?
  I already find myself mentally filtering out audible releases after a certain date unless they're from an author I recognize.
  
  Reply View | 0 replies
kevindamm 10 months ago

Yes but not quite as far as you imply. The training data is weighted by a quality metric, articles written by journalists and wikipedia contributors are given more weight than Aunt May's brownie recipe and corpoblogspam.

Reply View | 44 replies
- jsheard 10 months ago
  
  > The training data is weighted by a quality metric
  At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.
  Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?
  
  Reply View | 10 replies
  
  acdha 10 months ago
  
  > Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?
  Google has been _monetizing_ the SEO game forever. They chose not to act against many notorious actors because the metric they optimize for is ad revenue and and those sites were loaded with ads. As long as advertisers didn’t stop buying, they didn’t feel much pressure to make big changes.
  A smaller company without that inherent conflict of interest in its business model can do better because they work on a fundamentally different problem.
  
  Reply View | 0 replies
  
  derefr 10 months ago
  
  > those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet
  The problem is that, of the signals you mention,
  • the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")
  • and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.
  Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.
  And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.
  Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)
  
  Reply View | 0 replies
  
  nneonneo 10 months ago
  
  Reminds me of a Google search I did yesterday: “Hezbollah” yields a little info box with headings “Overview”, “History”, “Apps” and “Return policy”.
  I’m guessing that the association between “pagers” and “Hezbollah” ended up creating the latter two tabs, but who knows. Maybe some AI video out there did a product review of Hezbollah.
  
  Reply View | 1 reply
  
  selestify 10 months ago
  
  Wow, you’re not kidding. The “return policy” info box officially links to https://www.reuters.com/world/middle-east/dozens-hezbollah-m...
  
  Reply View | 0 replies
  
  Suppafly 10 months ago
  
  >At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight.
  I've noticed that lately. It used to be the top google result was almost always what you needed. Now at the top is an AI summary that is pretty consistently wrong, often in ways that aren't immediately obvious if you aren't familiar with the topic.
  
  Reply View | 0 replies
  
  epgui 10 months ago
  
  I don’t think they were talking about the quality of Google search results. I believe they were talking about how the data was processed by the wordfreq project.
  
  Reply View | 1 reply
  
  kevindamm 10 months ago
  
  I was actually referring to the data ingestion for training LLMs, I don't know what filtering or weighting might be done with wordfreq.
  
  Reply View | 0 replies
  
  noirscape 10 months ago
  
  Google has those problems because the company's revenue source (Ads) and the thing that puts it on the map (Search) are fundamentally at odds with one another.
  A useful Search would ideally send a user to the site with the most signal and the fewest noise. Meanwhile, ads are inherently noise; they're extra pieces of information inserted into a webpage that at best tangentially correlate to the subject of a page.
  Up until ~5 years ago, Google was able to strike a balance on keeping these two stable; you'd get results with some Ads but the signal generally outweighed the noise. Unfortunately from what I can tell from anecdotes and courtroom documents, the Ad team at Google has essentially hijacked every other aspect of the company by threatening that yearly bonuses won't be given out if they don't kowtow to the Ad teams wishes to optimize ad revenue somewhere in 2018-2019 and has no sign of stopping since there's no effective competition to Google. (There's like, Bing and Kagi? Nobody uses Bing though and Kagi is only used by tech enthusiasts. The problem with Google is that to copy it, you need a ton of computing resources upfront and are going up against a company with infinitely more money and ability to ensure users don't leave their ecosystem; go ahead and abandon Search, but good luck convincing others to give up say, their Gmail account, which keeps them locked to Google and Search will be there, enticing the average user.)
  Google has absolutely zero incentive to filter out generative AI junk from their search results outside the amount of it that's damaging their PR since most of the SEO spam is also running Google Ads (since unless you're hosting adult content, Google's ad network is practically the only option). Their solution therefore isn't to remove the AI junk, but to instead reduce it enough to the degree where a user will not get the same type of AI junk twice.
  
  Reply View | 1 reply
  
  PaulHoule 10 months ago
  
  My understanding is that Google Ads are what makes Google Search unassailable.
  A search engine isn't a two-sided market in itself but the ad network that supports it is. A better search engine is a technological problem, but a decently paying ad network is a technological problem and a hard marketing problem.
  
  Reply View | 0 replies
  
  inquirerGeneral 10 months ago
  
  [dead]
  
  Reply View | 0 replies
- Freak_NL 10 months ago
  
  It certainly feels like the amount of regurgitated, nonsensical, generated content (nontent?) has risen spectacularly specifically in the past few years. 2021 sounds about right based on just my own experience, even though I can't point to any objective source backing that up.
  
  Reply View | 9 replies
  
  eszed 10 months ago
  
  Upvoted for "nontent" alone: it'll be my go-to term from now on, and I hope it catches on.
  Is it of your own coinage? When the AI sifts through the digital wreckage of the brief human empire, they may give you the credit.
  
  Reply View | 1 reply
  
  Freak_NL 10 months ago
  
  I do hope it catches on! I did come up with this myself, but I really doubt I'm the only one — and indeed: Wiktionary lists it already with a 2023 vintage:
  https://en.wiktionary.org/wiki/nontent
  
  Reply View | 0 replies
  
  zharknado 10 months ago
  
  Ooh I like “nontent.” Nothing like a spicy portmanteau!
  
  Reply View | 0 replies
  
  eptcyka 10 months ago
  
  I personally am yet to see this beyond some slop on youtube. And I am here for the AI meme videos. I recognize the dangers of this, all I am saying is that I don't feel the effect, yet.
  
  Reply View | 3 replies
  
  jsheard 10 months ago
  
  SEO grifters have fully integrated AI at this point, there are dozens of turn-key "solutions" for mass-producing "content" with the absolute minimum effort possible. It's been refined to the point that scraping material from other sites, running it through the LLM blender to make it look original, and publishing it on a platform like Wordpress is fully automated end-to-end.
  
  Reply View | 1 reply
  
  sahmeepee 10 months ago
  
  Or check out "money printer" on github: a tongue in cheek mashup of various tools to take a keyword as input and produce a youtube video with subtitles and narration as output.
  
  Reply View | 0 replies
- darby_nine 10 months ago
  
  Aunt may's brownie recipe (or at least her thoughts on it) are likely something you'd want if you want to reflect how humans use language. Both news-style and encyclopedia-style writing represent a pretty narrow slice.
  
  Reply View | 21 replies
  
  creshal 10 months ago
  
  That's why search engines rated them highly, and why a million spam sites cropped up that paid writers $1/essay to pretend to be Aunt May, and why today every recipe website has a gigantic useless fake essay in front of their copypasted made up recipes.
  
  Reply View | 20 replies
- Lalabadie 10 months ago
  
  The current state of things leads me to believe that Google's current ranking system has been somehow too transparent for the last 2-3 years.
  The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.
  
  Reply View | 0 replies
sahmeepee 10 months ago

Prior to Google we had Altavista and in those days it was incredibly common to find keywords spammed hundreds of times in white text on a white background in the footer of a page. SEO spam is not new, it's just different.

Reply View | 0 replies
rockskon 10 months ago

Don't forget Google's adsense rules which penalized useful straightforward websites and mandated websites be full of "content". Doesn't matter if the "content" is garbage nonsense rambling and excessive word use - it's content and much more likely to be okayed by adsense!

Reply View | 0 replies
pphysch 10 months ago

It's crazy to attribute the downfall of the web/search to Google. What does Google have to do with all the genuine open web content, Google's source of wealth, getting starved by (increasingly) walled gardens like Facebook, Reddit, Discord?
I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.

Reply View | 0 replies
redbell 10 months ago

> ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.
Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.

Reply View | 0 replies
rgrieselhuber 10 months ago

Indexability is orthogonal to readability.

Reply View | 1 reply
- hk__2 10 months ago
  
  It should be, but sadly it’s not.
  
  Reply View | 0 replies
krelian 10 months ago

>And yet LLMs were still fed articles written for Googlebot, not humans.
How do we know what content LLMs were fed? Isn't that a highly guarded secret?
Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?

Reply View | 1 reply
- GTP 10 months ago
  
  We do know that the open web consitutes the bulk of the trainig data, although we don't get to know the specific webpages that got used. Plus some more selected sources, like books, of which again we only know that those are books but not which books were used. So it's just a matter of probability that there was a good amount of SEO spam as well.
  
  Reply View | 0 replies

jgrahamc 10 months ago

I created https://lowbackgroundsteel.ai/ in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.

Reply View 44 replies

LeoPanthera 10 months ago
Congratulations on "shipping", I've had a background task to create pretty much exactly this site for a while. What is your cutoff date? I made this handy list, in research for mine:
2017: Invention of transformer architecture June 2018: GPT-1 February 2019: GPT-2 June 2020: GPT-3 March 2022: GPT-3.5 November 2022: ChatGPT
You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.
Reply View | 1 reply
- jgrahamc 10 months ago
  
  I was taking "Release of ChatGPT" as the Trinity date.
  
  Reply View | 0 replies
VyseofArcadia 10 months ago

Clever name. I like the analogy.

Reply View | 27 replies
- freilanzer 10 months ago
  
  I don't seem to get it.
  
  Reply View | 26 replies
  
  ziddoap 10 months ago
  
  Steel without nuclear contamination is sought after, and only available from pre-war / pre-atomic sources.
  The analogy is that data is now contaminated with AI like steel is now contaminated with nuclear fallout.
  https://en.wikipedia.org/wiki/Low-background_steel
  >Low-background steel, also known as pre-war steel[1] and pre-atomic steel,[2] is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from ships (either as part of regular scrapping or shipwrecks) and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.[3][4]
  
  Reply View | 11 replies
  
  AlphaAndOmega0 10 months ago
  
  It's a reference to the practise of scavenging steel from sources that were produced before nuclear testing began, as any steel produced afterwards is contaminated with nuclear isotopes from the fallout. Mostly ship wrecks, and WW2 means there are plenty of those. The pun in question is that his project tries to source text that hasn't been contaminated with AI generated material.
  https://en.m.wikipedia.org/wiki/Low-background_steel
  
  Reply View | 0 replies
  
  ms512 10 months ago
  
  After the detonation of the first nuclear weapons, any newly produced steel has a low dose of nuclear fallout.
  For applications that need to avoid the background radiation (like physics research), pre atomic age steel is extracted, like from old shipwrecks.
  https://en.m.wikipedia.org/wiki/Low-background_steel
  
  Reply View | 0 replies
  
  GreenWatermelon 10 months ago
  
  From the blog
  > Low Background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945.
  
  Reply View | 0 replies
  
  voytec 10 months ago
  
  To whomever downvoted parent: please don't act against people brave enough to state that they don't know something.
  This is a desired quality, increasingly less present in IT work environments. People afraid of being shamed for stating knowledge gaps are not the folks you want to work with.
  
  Reply View | 6 replies
  
  KeplerBoy 10 months ago
  
  Steel made before atmospheric tests of nuclear bombs were a thing is referred to as low background steel and invaluable for some applications.
  LLMs pollute the internet like atomic bombs polluted the environment.
  
  Reply View | 0 replies
  
  cdman 10 months ago
  
  https://en.wikipedia.org/wiki/Low-background_steel
  
  Reply View | 0 replies
  
  [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies
  
  [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies
astennumero 10 months ago

That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.

Reply View | 2 replies
- rovr138 10 months ago
  
  The sources are just aggregated. The source doesn't change.
  The new stuff generated does (and this is honestly already captured).
  This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.
  Their research and projects are great.
  
  Reply View | 0 replies
- iak8god 10 months ago
  
  The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.
  Making resources like wordfreq more visible won't exacerbate any of these concerns.
  
  Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies
Der_Einzige 10 months ago

FYI: My two datasets, DebateSum and OpenDebateEvidence/OpenCaseList in their current forms qualify for this, as they end at latest in 2022.

Reply View | 1 reply
- jgrahamc 10 months ago
  
  You can either add them to the site yourself via Tumblr or send them to me via email (jgc@cloudflare).
  
  Reply View | 0 replies
imhoguy 10 months ago

I am not sure we should trust a site contaminated by AI graphics. /s

Reply View | 4 replies
- gorkish 10 months ago
  
  The buildings and shipping containers that store low background steel aren't built out of the stuff either.
  
  Reply View | 0 replies
- whywhywhywhy 10 months ago
  
  Yeah pay an illustrator if this is important to you.
  See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.
  
  Reply View | 2 replies
  
  imhoguy 10 months ago
  
  I have updated my comment with "/s" as that is closer to what I've meant. However, seriously, from ethical point of view it is unlikely illustrators were asked or compensated for their work being used for training AI to produce the image.
  
  Reply View | 1 reply
  
  heckelson 10 months ago
  
  I thought the header image was a symbol of AI slop contamination because it looked really off-putting
  
  Reply View | 0 replies
ClassyJacket 10 months ago

:'( I thought I was clever for realising this parallel myself! Guess it's more obvious than I thought.
Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.

Reply View | 2 replies
- thebruce87m 10 months ago
  
  I too realised this parallel and frequently tell people about it.
  Edit: just the first one
  
  Reply View | 0 replies
- sweeter 10 months ago
  
  [flagged]
  
  Reply View | 0 replies

jll29 10 months ago

I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say "we're not all like that", even though it is a trend and we're close to peak hype (slightly past even?).

The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").

Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.

When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.

Reply View 44 replies

solardev 10 months ago

Have "good" small webs EVER prevailed?
Every content system seems to get polluted by noise once it hits mainstream usage: IRC, Usenet, reddit, Facebook, geocities, Yahoo, webrings, etc. Once-small curated selections eventually grow big enough to become victims of their own successes and taken over by spam.
It's always an arms race of quality vs quantity, and eventually the curators can't keep up with the sheer volume anymore.

Reply View | 31 replies
- squigz 10 months ago
  
  > Have "good" small webs EVER prevailed?
  You ask on HN, one of the highest quality sites I've ever visited in any age of the Internet.
  IRC is still alive and well among pretty much the same audience as always. I'm not sure it's fair to compare that with the others.
  
  Reply View | 25 replies
  
  solardev 10 months ago
  
  Well, niche forums are kinda different when they manage to stay small and niche. Not just HN but car forums, LED forums, etc.
  But if they ever include other topics, they risk becoming more mainstream and noisy. Even within adjacent fields (like the various Stacks) it gets pretty bad.
  Maybe the trick is to stay within a single small sphere then and not become a general purpose discussion site? And to have a low enough volume of submissions where good moderation is still possible? (Thank you dang and HN staff)
  
  Reply View | 3 replies
  
  bongodongobob 10 months ago
  
  It's high quality when the content is within HN's bubble. Anything related to health, politics, or Microsoft is full of misinformation, ignorance, and garbage like any other site. The Microsoft discussions in particular are extremely low quality.
  
  Reply View | 20 replies
- htrp 10 months ago
  
  Any curation mechanism that depends on passion and/or the goodwill of volunteers is unsustainable.
  
  Reply View | 0 replies
- 38 10 months ago
  
  its so easy to solve this problem, not sure why anyone hasnt done it yet.
  1. build a userbase, free product
  2. once userbase get big enough, any new account requires a monthly fee, maybe $1
  3. keep raising the fee higher and higher, until you get to the point that the userbase is manageable.
  no ads, simple.
  
  Reply View | 3 replies
  
  jachee 10 months ago
  
  Until N ad views are worth more than $X account creation fee. Then the spammers will just sell ad posts for $X*1.5.
  I can’t find it, but there’s someone selling sock puppet posts on HN even.
  
  Reply View | 0 replies
  
  abridges6523 10 months ago
  
  This sounds like a good idea. I do wonder if enough people would sign up for it to be a worthy venture because I think the main issue with ads is I think once you add any price at all dramatically reduces participation even if it’s not about cost some people just see the payment and immediately disengage.!
  
  Reply View | 0 replies
  
  [removed] 10 months ago
  
  [deleted]
  
  Reply View | 0 replies
squigz 10 months ago

> people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening
The people who stay away from critical thinking were doing that already and will continue to do so, 'AI' content or not.

Reply View | 8 replies
- psychoslave 10 months ago
  
  I don't know, individually finely tuned addictive content served as real time interactive feedback loops is an other level of propaganda and attention capture tool than largest common denominator of the general crowd served as static passive content.
  
  Reply View | 1 reply
  
  squigz 10 months ago
  
  Perhaps, but the solution is the same either way, and it isn't trying to ban technology or halt progress or just sit and cry about how society is broken. It's educating each other and our children on the way these things work, how to break out of them, and how we might more responsibly use the technology.
  
  Reply View | 0 replies
- trehalose 10 months ago
  
  How did they get started?
  
  Reply View | 5 replies
  
  squigz 10 months ago
  
  They likely never started critically thinking, so they never had to get started on not doing so.
  (If children are never taught to think critically, then...)
  
  Reply View | 4 replies
Llamamoe 10 months ago

> Good will prevail in the end.
Even if, this is a dangerous thought that discourages decisive action that is likely to be necessary for this to happen.

Reply View | 0 replies
sweeter 10 months ago

tangentially related, but Marx also predicted that crypto and NFT's would exist in 1894 [1] and I only bring it up because its kind of wild how we keep crossing these "red lines" without even blinking. It's like that meme:
Sci-fi author:
I created the Torment Nexus to serve as a cautionary tale...
Tech Company:
Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus"
1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm

Reply View | 0 replies
Intralexical 10 months ago

What if the way for good to prevail is to reject technologies and beliefs that have become destructive?

Reply View | 0 replies

0xbadcafebee 10 months ago

I'm going to call it: The Web is dead. Thanks to "AI" I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.

As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.

Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.

Reply View 24 replies

Vegenoid 10 months ago

On Amazon, you used to be able to search the reviews and Q&A section via a search box. This was immensely useful. Now, that search box first routes your search to an LLM, which makes you wait 10-15 seconds while it searches for you. Then it presents its unhelpful summary, saying "some reviews said such and such", and I can finally click the button to show me the actual reviews and questions with the term I searched.
This is going to be the thing that makes me quit Amazon. If I'm missing something and there's still a way to to a direct search, please tell me.

Reply View | 3 replies
- cosmotron 10 months ago
  
  You can still get to product reviews directly and search them. Here's an example:
  Product page (copy the identifier at the end): https://www.amazon.com/Long-Thanks-Hitchhikers-Guide-Galaxy-...
  Review page (paste the identifier at the end): https://www.amazon.com/product-reviews/B001OF5F1E/
  This seems to bypass all of the LLM stuff for now.
  
  Reply View | 1 reply
  
  Vegenoid 10 months ago
  
  Pretty good! Unfortunately it does not include the Q&As, which are often just as useful as the reviews.
  
  Reply View | 0 replies
- graeme 10 months ago
  
  Ran into this the other day. Amazon.ca still has the old version for now
  
  Reply View | 0 replies
bbarn 10 months ago

No disagreement for the most part.
I used to be able to say search for Trek bike derailleur hanger and the first result would be what I wanted. Now I have to scroll past 5 ads to buy a new bike, one that's a broken link to a third party, and if I'm really lucky, at the bottom of page 1 will be the link to that part's page.
The shitification of the web is real.

Reply View | 2 replies
- klyrs 10 months ago
  
  R.I.P. Sheldon Brown T_T
  (The Agner Fog of cycling?)
  
  Reply View | 1 reply
  
  bbarn 10 months ago
  
  He was a legend.
  
  Reply View | 0 replies
Gethsemane 10 months ago

Sounds like your laptop is wholly out of date, you need to buy the next generation of laptops on Amazon that can handle the modern SEO load. I recommend the:
LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID, WiFi, BT4.2, for Students/Business
Name rolls off the tongue doesn’t it

Reply View | 1 reply
- tim333 10 months ago
  
  Or a Macbook.
  
  Reply View | 0 replies
cedric_h 10 months ago

There is a startup whose product is better search. The killer feature is that you pay for it, so you aren't the produdct. https://kagi.com/welcome

Reply View | 1 reply
- codezero 10 months ago
  
  Can vouch for this. It’s the first non-Google search alternative I’ve used that has 100% replaced Google. I don’t need Google as a fallback like I did with others.
  
  Reply View | 0 replies
akkartik 10 months ago

I've been slowly detaching myself from the web for the past 10 years. These days I mostly build offline apps using native technologies. Those capabilities are still around. They just receded for a while because they'd gotten so polluted with toolbars and malware. But now the malware is on the other side, and native apps are cool again. If you know where to look. Here's my shingle: https://akkartik.name/freewheeling-apps
On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now.

Reply View | 0 replies
w10-1 10 months ago

> If I can in any way purchase something without the web, I'mma do that
To get to the milk you'll have to walk by 3 rows of chips and soda.

Reply View | 5 replies
- odo1242 10 months ago
  
  Yeah, this is why I still use the web to order things in a nutshell lol
  
  Reply View | 4 replies
  
  0xbadcafebee 10 months ago
  
  Where do you order things online that you aren't inundated by ads?
  
  Reply View | 3 replies
matrix87 10 months ago

> Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.
Hey, who cares about making services that work when we can give people a cool chatbot assistant and a 1800 number with no real-person alternative to the decision tree

Reply View | 0 replies
gazook89 10 months ago

The web is much more than a shopping site.

Reply View | 1 reply
- yifanl 10 months ago
  
  It is, but the SEO spammers who ruined the web want it to be shopping mall, and they can't even do a particularly good job at being one.
  
  Reply View | 0 replies
nlpparty 10 months ago

I suppose it is just Amazon problems. I have never lived in the area where Amazon is prevalent. Where I live, search engines still can't find synonyms or process misspellings.

Reply View | 0 replies
BeetleB 10 months ago

If search is your metric, the web was dead long before OpenAI's release of GPT. I gave up on web search a long time ago.

Reply View | 0 replies
kristopolous 10 months ago

for tech stuff I just use documentation, bug trackers and source code now. Web searching has become useless.

Reply View | 0 replies

weinzierl 10 months ago

"I don't think anyone has reliable information about post-2021 language usage by humans."

We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.

Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.

The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.

Reply View 39 replies

frognumber 10 months ago

There are a series of challenges like:
https://www.nytimes.com/interactive/2024/09/09/technology/ai...
https://www.nytimes.com/interactive/2024/01/19/technology/ar...
These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace).
What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100.
The revolutions shorten in time, seemingly exponentially.
Comparing the world of today to that of my childhood....
One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.
I'm not quite sure where this is all headed.

Reply View | 9 replies
- homebrewer 10 months ago
  
  > so much stuff is basically free
  It really isn't. Have a look at daily median income statistics for the rest of the planet:
  https://ourworldindata.org/grapher/daily-median-income?tab=t...
  $2.48 Eastern and Southern Africa (PIP) $2.78 Sub-Saharan Africa (PIP) $3.22 Western and Central Africa (PIP) $3.72 India (rural) $4.22 South Asia (PIP) $4.60 India (urban) $5.40 Indonesia (rural) $6.54 Indonesia (urban) $7.50 Middle East and North Africa (PIP) $8.05 China (rural) $10.00 East Asia and Pacific (PIP) $11.60 Latin America and the Caribbean (PIP) $12.52 China (urban)
  And more generally:
  $7.75 World
  I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet.
  
  Reply View | 0 replies
- knodi123 10 months ago
  
  +100w chargers are one of the products I prefer to spend a little more on, so I get something from a company that knows it can be sued if they make a product that burns down your house or fries your phone.
  Flashlights? Sure, bring on aliexpress. USB cables with pop-off magnetically attached heads, no problem. But power supplies? Welp, to each their own!
  
  Reply View | 2 replies
  
  fph 10 months ago
  
  And then you plug your cheap pop-off USB cable into the expensive 100w charger?
  
  Reply View | 1 reply
  
  knodi123 10 months ago
  
  Yeah, sure, what could possibly go wrong? :-P
  But seriously, it's harder to accidentally make a USB cable that fries your equipment. The more common failure mode is it fails to work, or wears out too fast. Chargers on the other hand, handle a lot of voltage, generate a lot of heat, and output to sensitive equipment. More room to mess up, and more room for mistakes to cause damage.
  
  Reply View | 0 replies
- bee_rider 10 months ago
  
  > One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.
  Is there a big recent qualitative change here? Or is this a continuation of manufacturing trends (also shocking, not trying to minimize it all, just curious if there’s some new manufacturing tech I wasn’t aware of).
  For some reason, your comment got me thinking of a fully automated system, like: you go to a website, pick and choose charger capabilities (ports, does it have a battery, that sort of stuff). Then an automated factor makes you a bespoke device (software picks an appropriate shell, regulators, etc). I bet we’ll see it in our lifetimes at least.
  
  Reply View | 0 replies
- csomar 10 months ago
  
  Democracy (and Republics) are thousands of year old. Computation is also quite old though it only sky-rocketed with electricity and semiconductors. This is not the first time the global world created a potential for exponential growth (I'll consider the Pharaohs and Roman empires to be ones).
  There is the very real possibility that everything just stalls and plateau where we are at. You know, like our population growth, it should have gone exponentially but it did not. Actually, quite the reverse.
  
  Reply View | 0 replies
- MengerSponge 10 months ago
  
  Democracy is 200? You're off by a full order of magnitude.
  Progress isn't inevitable. It's possible for knowledge to be lost and for civilization to regress.
  
  Reply View | 1 reply
  
  frognumber 10 months ago
  
  Okay. You're right about what I wrote. Let me rephrase what I meant. I was missing the words "the widespread adoption of"
  Athens had a democracy over 2500 years ago. A few Native American tribes had long-lasting democracies. Ukrainian cities were democratically self-governing 500 years ago, and Poland had elected kings.
  Those were isolated examples. This was not a revolution. We also haven't regressed; isolated examples continued throughout history. If you point to a year, you can probably find some democracy somewhere. The only major regression I know in history was around 1000BC. Regressions are rare.
  What changed was a revolution. From just before 1800 to just a little after 1900, virtually every country had a revolution which led to either being some form of democracy, or pretending to be one. Democracy was no longer isolated. We had the creation of a free world covering much of the world's population, and the creation of what was pretending to be a democracy (today, even the Democratic People's Republic of Korea pretends to be a democracy).
  The number of countries which claim to not be a democracy, you can count on your fingers. Iran. Vatican City. Saudi Arabia. UAE. Oman. Eswanti. Did I miss any?
  https://en.wikipedia.org/wiki/List_of_countries_by_system_of...
  
  Reply View | 0 replies
- jodrellblank 10 months ago
  
  > "The revolutions shorten in time, seemingly exponentially."
  The Technological Singularity - https://en.wikipedia.org/wiki/Technological_singularity
  
  Reply View | 0 replies
bsder 10 months ago

> When I get asked if the person in a video is real, I still feel pretty confident to answer
I don't share your confidence in identifying real people anymore.
I often flag as "false-ish" a lot of things from genuinely real people, but who have adopted the behaviors of the TikTok/Insta/YouTube creator. Hell, my beard is grey and even I poked fun at "YouTube Thumbnail Face" back in 2020 in a video talk I gave. AI twigs into these "semi-human" behavioral patterns super fast and super hard.
There is a video floating around with pairs of young ladies with "This is real"/"This is not real" on signs. They could be completely lying about both, and I really can't tell the difference. All of them have behavioral patterns that seems a little "off" but are consistent with the small number of "influencer" videos I have exposure to.

Reply View | 0 replies
apricot 10 months ago

> When I get asked if the person in a video is real, I still feel pretty confident to answer
I don't. I mean, I can identify the bad ones, sure, but how do I know I'm not getting fooled by the good ones?

Reply View | 1 reply
- weinzierl 10 months ago
  
  That is very true, but for now we have a baseline of videos that we either remember or that we remember key details of, like the persons in the video. I'm pretty sure if I watch The Primeagen or Tom Scott today, that they are real. Ask me in year, I might not be so sure anymore.
  
  Reply View | 0 replies
olabyne 10 months ago

I never thought about that. Humans losing their ability to detect AI content from reality ? It's frightening.

Reply View | 25 replies
- BiteCode_dev 10 months ago
  
  It's worse because many humans don't know they are.
  I see a lot of outrage around fake posts already. People want to believe bad things from the other tribes.
  And we are going to feed them with it, endlessly.
  
  Reply View | 3 replies
  
  PhunkyPhil 10 months ago
  
  Did you think the same thing when photoshop came out?
  It's relatively trivial to photoshop misinformation in a really powerful and undetectable way- but I don't see (legitimate) instances of groundbreaking news over a fake photo of the president or a CEO etc doing something nefarious. Why is AI different just because it's audio/video?
  
  Reply View | 2 replies
- jerf 10 months ago
  
  It's even worse than that. Most people have no idea how far CGI has come, and how easily it is wielded even by a couple of dedicated teens on their home computer, let alone people with a vested interest in faking something for some financial reason. People think they know what a "special effect" looks like, and for the most part, people are wrong. They know what CGI being used to create something obviously impossible, like a dinosaur stomping through a city, looks like. They have no idea how easy a lot of stuff is to fake already. AI just adds to what is already there. Heck, to some extent it has caused scammers to overreach, with things like obviously fake Elon Musk videos on YouTube generated from (pure) AI and text-to-speech... when with just a little bit more learning, practice, and amounts of equipment completely reasonable for one person to obtain, they could have done a much better fake of Elon Musk using special effects techniques rather than shoveling text into an AI. The fact that "shoveling text into an AI" may in another few years itself generate immaculate videos is more a bonus than a fundamental change of capability.
  Even what's free & open source in the special effects community is astonishing lately.
  
  Reply View | 10 replies
  
  jhbadger 10 months ago
  
  And you see things like the The Lion King remake or its upcoming prequel being called "live action" because it doesn't look like a cartoon like the original. But they didn't film actual lions running around -- it's all CGI.
  
  Reply View | 0 replies
  
  bee_rider 10 months ago
  
  Plus, movies continue (for some reason) to be made with very bad and obvious CGI, leading people to believe all CGI is easy to spot.
  
  Reply View | 8 replies
- hn_throwaway_99 10 months ago
  
  I mean, it's already apparent to me that a lot of people don't have a basic process in place to detect fact from fiction. And it's definitely not always easy, but when I hear some of the dumbest conspiracy theories known to man actually get traction in our media, political figures, and society at large, I just have to shake my head and laugh to keep from crying. I'm constantly reminded of my favorite saying, "people who believe in conspiracy theories have never been a project manager."
  
  Reply View | 0 replies
- bongodongobob 10 months ago
  
  Oh they definitely are. A lot of people are now calling out real photos as fake. I frequently get into stupid Instagram political arguments and a lot of times they come back with "yeah nice profile with all your AI art haha". It's all real high quality photography. Honestly, I don't think the avg person can tell anymore.
  
  Reply View | 3 replies
  
  ziml77 10 months ago
  
  I've reached a point where even if my first reaction to a photo is to be impressed, I then quickly think "oh but what it this is AI?" and then immediately my excitement for the photo is ruined because it may not actually be a photo at all.
  
  Reply View | 2 replies
- Sharlin 10 months ago
  
  It's worse: they don't even care.
  
  Reply View | 0 replies
- Suppafly 10 months ago
  
  >Humans losing their ability to detect AI content from reality ? It's frightening.
  And it already happened, and no one pushed back while it was happening.
  
  Reply View | 0 replies
- bunderbunder 10 months ago
  
  This video's worth a watch if you want to get a sense of the current state of things. Despite the (deliberately) clickbait title, the video itself is pretty even-handed.
  It's by Language Jones, a YouTube linguist. Title: "The AI Apocalypse is Here"
  https://youtu.be/XeQ-y5QFdB4
  
  Reply View | 0 replies
- wraptile 10 months ago
  
  I find issue with this statement as content was never a clean representation of human actions or even thought. It was always driven by editorials, SEO, bot remixing and whatnot that heavily influences how we produce content. One might even argue that heightened content distrust is _good_ for our society.
  
  Reply View | 0 replies
- BeFlatXIII 10 months ago
  
  It's a defense lawyer's dream.
  
  Reply View | 0 replies

dweinus 10 months ago

> Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing.

Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.

Reply View 8 replies

[removed] 10 months ago

[deleted]

Reply View | 0 replies
andrethegiant 10 months ago

That sentence stood out to me too, very powerful. Felt it right in the feels.

Reply View | 0 replies
FrustratedMonky 10 months ago

[flagged]

Reply View | 5 replies
- commodoreboxer 10 months ago
  
  The problem is that for the vast majority of use, LLM output is not revised or edited, and very many times I'm convinced the output wasn't even fully read.
  
  Reply View | 4 replies
  
  robrtsql 10 months ago
  
  I assume FrustratedMonky's comment was satirical, given that it appears to have been written like an LLM and starts with a "but, but, but" which is how you might represent someone you disagree with presenting their argument.
  
  Reply View | 3 replies

dsign 10 months ago

Somehow related, paper books from before 2020 could be a valuable commodity in a in a decade or two, when the Internet will be full of slop and even contemporary paper books will be treated with suspicion. And there will be human talking heads posing as the authors of books written by very smart AIs. God, why are we doing this????

Reply View 5 replies

rvnx 10 months ago

To support well-known “philanthropists” like Sam Altman or Mark Zuckerberg that many consider as their heroes here.

Reply View | 0 replies
user432678 10 months ago

And I thought I had some kind of mental illness collecting all those books, barely reading them. Need to do that more now.

Reply View | 1 reply
- globular-toast 10 months ago
  
  Yes. I've always loved my books but now consider them my most valuable possessions.
  
  Reply View | 0 replies
RomanAlexander 10 months ago

Or AI talking heads posing as the author of books written by AIs. https://youtu.be/pAPGRGTqIgI (warning: state sponsored disinformation AI)

Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies

aryonoco 10 months ago

I feel so conflicted about this.

On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).

But...

A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.

The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).

Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.

Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.

Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.

Reply View 3 replies

sangnoir 10 months ago

> Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance
Didn't most DVD burning software include video transcoding as a standard feature? Back in the day, you'd have used Nero Burning ROM, or Handbrake - granted, the quality may not have been optimized to your standards, but the result would have been a watchable video (especially to 97 year-old eyes)

Reply View | 2 replies
- aryonoco 10 months ago
  
  Back in the day they did. I checked handbrake but now there's nothing specific about DVD compatibility there. I could have picked something like Super HQ 576p, and there's a good chance that would have sufficed, but old DVD players were extremely finicky about filenames, extensions, interlacing, etc. I didn't want to risk the DVD traveling half way across the world only to find that it's not playable.
  
  Reply View | 1 reply
  
  sangnoir 10 months ago
  
  I mentioned Handbrake without checking its DVD authoring capability - probably used it to rip DVDs many years ago and got it mixed up with burning them; a better FLOSS alternative for authoring would have been DeVeDe or bombono.
  
  Reply View | 0 replies

aucisson_masque 10 months ago

Did we (the humans) somehow managed to pollute the internet so much with AI that's it's now barely usable ?

In my opinion the internet can be considered as the equivalent of a natural environment like the earth. it's a space where people share, meet, talk, etc.

I find it astonishing that after polluting our natural environment we know polluted the internet.

Reply View 18 replies

nkozyra 10 months ago

> Did we (the humans) somehow managed to pollute the internet so much with AI that's it's now barely usable
If we haven't already, we will be very soon. I'm sure there are people working on this problem, but I think we're starting to hit a very imminent feedback loop moment. Most of human's recorded information is digitized and most of that is generating non-human content at an incredible pace. We've injected a whole lot of noise into our usable data.
I don't know if the answer is more human content (I'm doing my part!) or novel generative content but this interim period is going to cause some medium-term challenges.
I like to think the LLM more-tokens-equals-better era is fading and we're getting into better use of existing data, but there's a very real inflection point we're facing.

Reply View | 0 replies
coldpie 10 months ago

There are smaller, gated communities that are still very valuable. You're posting in one. But yes, the open Internet is basically useless now, thanks ultimately to advertising as a business model.

Reply View | 9 replies
- nicholassmith 10 months ago
  
  I've seen plenty of comments here that read like they've been generated by an LLM, if this is a gated community we need a better gate.
  
  Reply View | 2 replies
  
  coldpie 10 months ago
  
  Sure, there's bad actors everywhere, but there's really no incentive to do it here so I don't think it's a problem in the same way it is on the open internet, where slop is actively rewarded.
  
  Reply View | 0 replies
  
  globular-toast 10 months ago
  
  It's hard to tell, though. People have been saying my borderline autistic comments sound like GPT for years now.
  
  Reply View | 0 replies
- lobsterthief 10 months ago
  
  Also our collective unwillingness to pay for subscriptions for publications
  
  Reply View | 1 reply
  
  Drew_ 10 months ago
  
  Publications need to charge a-la-carte instead of force feeding subscriptions.
  
  Reply View | 0 replies
- whimsicalism 10 months ago
  
  this is not a gated community at all
  
  Reply View | 3 replies
  
  coldpie 10 months ago
  
  True, that is maybe too strong a phrase, but I think it's close to accurate. I think the culture & medium provide kind of a self-selecting gate: it's just plain text and links to articles, with the discussion expected by culture to be fairly serious. I think that turns off enough people that it kind of forms its own gate shutting out the people that make "eternal Septembers" happen. But yeah, ultimately, you're right.
  
  Reply View | 2 replies
thwarted 10 months ago

Tragedy of the Commons Ruins Everything Around Me

Reply View | 0 replies
surfingdino 10 months ago

Yes. Here are practical instructions on how to turn it into an even more of a cesspit https://www.youtube.com/watch?v=endHz0jo9Ck I think it's now a law of nature that any new tech leads to SEO amplification. AI has become the Degelman M34 Manure Spreader of the internet https://degelman.com/products/manure-spreaders

Reply View | 0 replies
ashton314 10 months ago

That's a nice analogy. Fortunately (un)real estate is easier to manufacture out of thin air online. We have lost some valuable spaces like Twitter and Reddit to some degree though.

Reply View | 0 replies
egypturnash 10 months ago

The public Internet has been relentlessly strip-mined for profit by ever since Canter & Siegel posted their immigration services ad to every single Usenet newsgroup.

Reply View | 0 replies
mathnmusic 10 months ago

> Did we (the humans) somehow managed to pollute the internet
Corporations did that, not humans.
"few people recognize that we already share our world with artificial creatures that participate as intelligent agents in our society: corporations" - https://arxiv.org/abs/1204.4116

Reply View | 0 replies
[removed] 10 months ago

[deleted]

Reply View | 0 replies
left-struck 10 months ago

>We the humans
Nice try
If it’s not clear, I’m joking.

Reply View | 0 replies

baq 10 months ago

All those writers who'll soon be out of job and/or already are and basically unhireable for their previous tasks should be paid for by the AI hyperscalers to write anything at all on one condition: not a single sentence in their works should be created with AI.

(I initially wanted to say 'paid for by the government' but that'd be socialising losses and we've had quite enough of that in the past.)

Reply View 13 replies

vidarh 10 months ago

There are already several companies doing this - I do occasional contract work for a couple -, and paying rates sometimes well above what an average earning writer can expect elsewhere. However, the vast majority of writers have never been able to make a living from their writing. The threshold to write is too love, too many people love it, and most people read very little.

Reply View | 4 replies
- baq 10 months ago
  
  Transformers read a lot during training, it might actually be beneficial for the companies to the point those works never see the light of day, only machines would read them. That's so dystopian I'd say those works should be published so they eventually get into the public domain.
  
  Reply View | 3 replies
  
  ckemere 10 months ago
  
  Rooms full of people writing into a computer is a striking mental picture. It feels like it could be background for a great plot for a book/movie.
  
  Reply View | 2 replies
tveita 10 months ago

Who programs the tapes? https://en.wikipedia.org/wiki/Profession_(novella)

Reply View | 1 reply
- jfultz 10 months ago
  
  _Thank you_. I read this story probably around 1980 (I think in a magazine that was subsequently trashed or garage-saled), and I have spent my adult life remembering the bones of the story, but not the author or the title.
  
  Reply View | 0 replies
bondarchuk 10 months ago

AI companies are indeed hiring such people to generate customized training data for them.

Reply View | 3 replies
- neilv 10 months ago
  
  Is it the same companies that simply took all the writers' previous work (hoping to be billionaires before the courts understand)?
  
  Reply View | 1 reply
  
  shadowgovt 10 months ago
  
  Yes. This was always the failure with the argument that copyright was the relevant issue... Once the model was proven out, we knew some wealthy companies would hire humans to generate the training data that the companies could then own in whole, at the relative expense of all other humans that didn't get paid to feed the machines.
  
  Reply View | 0 replies
- passion__desire 10 months ago
  
  This idea could also be extended to domains like Art. Create new art styles for AI to learn from. But in future, that will also get automated. AI itself will create art styles and all humans would do is choose whether something is Hot or Not. Sort of like art breeder.
  
  Reply View | 0 replies
nkozyra 10 months ago

People have been paid to generate noise for a decade+ now. Garbage in, garbage out will always be true.
Next token-seeking is a solved problem. Novel thinking can be solved by humans and possibly by AI soon, but adding more garbage to the data won't improve things.

Reply View | 0 replies
trilbyglens 10 months ago

Have you ever read american history? Lol.

Reply View | 0 replies