Comment by kevindamm

Comment by kevindamm 10 months ago

Yes but not quite as far as you imply. The training data is weighted by a quality metric, articles written by journalists and wikipedia contributors are given more weight than Aunt May's brownie recipe and corpoblogspam.

jsheard 10 months ago

> The training data is weighted by a quality metric

At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.

Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

Reply View 10 replies

acdha 10 months ago

> Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?
Google has been _monetizing_ the SEO game forever. They chose not to act against many notorious actors because the metric they optimize for is ad revenue and and those sites were loaded with ads. As long as advertisers didn’t stop buying, they didn’t feel much pressure to make big changes.
A smaller company without that inherent conflict of interest in its business model can do better because they work on a fundamentally different problem.

Reply View | 0 replies
derefr 10 months ago

> those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet
The problem is that, of the signals you mention,
• the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")
• and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.
Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.
And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.
Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)

Reply View | 0 replies
nneonneo 10 months ago

Reminds me of a Google search I did yesterday: “Hezbollah” yields a little info box with headings “Overview”, “History”, “Apps” and “Return policy”.
I’m guessing that the association between “pagers” and “Hezbollah” ended up creating the latter two tabs, but who knows. Maybe some AI video out there did a product review of Hezbollah.

Reply View | 1 reply
- selestify 10 months ago
  
  Wow, you’re not kidding. The “return policy” info box officially links to https://www.reuters.com/world/middle-east/dozens-hezbollah-m...
  
  Reply View | 0 replies
Suppafly 10 months ago

>At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight.
I've noticed that lately. It used to be the top google result was almost always what you needed. Now at the top is an AI summary that is pretty consistently wrong, often in ways that aren't immediately obvious if you aren't familiar with the topic.

Reply View | 0 replies
epgui 10 months ago

I don’t think they were talking about the quality of Google search results. I believe they were talking about how the data was processed by the wordfreq project.

Reply View | 1 reply
- kevindamm 10 months ago
  
  I was actually referring to the data ingestion for training LLMs, I don't know what filtering or weighting might be done with wordfreq.
  
  Reply View | 0 replies
noirscape 10 months ago

Google has those problems because the company's revenue source (Ads) and the thing that puts it on the map (Search) are fundamentally at odds with one another.
A useful Search would ideally send a user to the site with the most signal and the fewest noise. Meanwhile, ads are inherently noise; they're extra pieces of information inserted into a webpage that at best tangentially correlate to the subject of a page.
Up until ~5 years ago, Google was able to strike a balance on keeping these two stable; you'd get results with some Ads but the signal generally outweighed the noise. Unfortunately from what I can tell from anecdotes and courtroom documents, the Ad team at Google has essentially hijacked every other aspect of the company by threatening that yearly bonuses won't be given out if they don't kowtow to the Ad teams wishes to optimize ad revenue somewhere in 2018-2019 and has no sign of stopping since there's no effective competition to Google. (There's like, Bing and Kagi? Nobody uses Bing though and Kagi is only used by tech enthusiasts. The problem with Google is that to copy it, you need a ton of computing resources upfront and are going up against a company with infinitely more money and ability to ensure users don't leave their ecosystem; go ahead and abandon Search, but good luck convincing others to give up say, their Gmail account, which keeps them locked to Google and Search will be there, enticing the average user.)
Google has absolutely zero incentive to filter out generative AI junk from their search results outside the amount of it that's damaging their PR since most of the SEO spam is also running Google Ads (since unless you're hosting adult content, Google's ad network is practically the only option). Their solution therefore isn't to remove the AI junk, but to instead reduce it enough to the degree where a user will not get the same type of AI junk twice.

Reply View | 1 reply
- PaulHoule 10 months ago
  
  My understanding is that Google Ads are what makes Google Search unassailable.
  A search engine isn't a two-sided market in itself but the ad network that supports it is. A better search engine is a technological problem, but a decently paying ad network is a technological problem and a hard marketing problem.
  
  Reply View | 0 replies
inquirerGeneral 10 months ago

[dead]

Reply View | 0 replies

Freak_NL 10 months ago

It certainly feels like the amount of regurgitated, nonsensical, generated content (nontent?) has risen spectacularly specifically in the past few years. 2021 sounds about right based on just my own experience, even though I can't point to any objective source backing that up.

Reply View 9 replies

eszed 10 months ago

Upvoted for "nontent" alone: it'll be my go-to term from now on, and I hope it catches on.
Is it of your own coinage? When the AI sifts through the digital wreckage of the brief human empire, they may give you the credit.

Reply View | 1 reply
- Freak_NL 10 months ago
  
  I do hope it catches on! I did come up with this myself, but I really doubt I'm the only one — and indeed: Wiktionary lists it already with a 2023 vintage:
  https://en.wiktionary.org/wiki/nontent
  
  Reply View | 0 replies
zharknado 10 months ago

Ooh I like “nontent.” Nothing like a spicy portmanteau!

Reply View | 0 replies
eptcyka 10 months ago

I personally am yet to see this beyond some slop on youtube. And I am here for the AI meme videos. I recognize the dangers of this, all I am saying is that I don't feel the effect, yet.

Reply View | 3 replies
- Freak_NL 10 months ago
  
  I'm seeing it a lot when searching for some advice in a well-defined subject, like, say, leatherworking or sewing (or recipes, obviously). Instead of finding forums with hobbyists, in-depth blog posts, or manufacturers advice pages, increasingly I find articles which seem like natural language at first, but are composed of paragraphs and headers repeating platitudes and basic tips. It takes a few seconds to realize the site is just pushing generated articles.
  Increasingly I find that for in-depth explanations or tutorials Youtube is the only place to go, but even there the search results can lead to loads of videos which just seem… off. But at least those are still made by humans.
  
  Reply View | 0 replies
- ghaff 10 months ago
  
  There's been a ton of low-rent listicle writing out there for ages. Certainly not new in the past few years. I admit I don't go on YouTube much and don't even have a tiktok account so it's possible there's a lot of newer lousy content I'm not really exposed to.
  It seems to me that the fact it's so cheap and relatively easy for people with dreams of becoming wealthy influencers to put stuff out there has more to do with the flood of often mediocre content than AI does.
  Of course the vast majority don't have much real success and get on with life and the crank turns and a new generation perpetuates the cycle.
  LLMs etc. may make things marginally easier but there's no shortage of twenty somethings with lots of time imagining riches while making pennies.
  
  Reply View | 0 replies
- sharpshadow 10 months ago
  
  Looking forward to watch perfect generated videos. We need so much more power and chips but it’s completely worth it. After that? Maybe generated videogames. But the video stuff will be awesome and changing the video dominated social media content for ever. Virtual headsets will become useful finally generating anything you want to see and jump tru space and time.
  
  Reply View | 0 replies
jsheard 10 months ago

SEO grifters have fully integrated AI at this point, there are dozens of turn-key "solutions" for mass-producing "content" with the absolute minimum effort possible. It's been refined to the point that scraping material from other sites, running it through the LLM blender to make it look original, and publishing it on a platform like Wordpress is fully automated end-to-end.

Reply View | 1 reply
- sahmeepee 10 months ago
  
  Or check out "money printer" on github: a tongue in cheek mashup of various tools to take a keyword as input and produce a youtube video with subtitles and narration as output.
  
  Reply View | 0 replies

darby_nine 10 months ago

Aunt may's brownie recipe (or at least her thoughts on it) are likely something you'd want if you want to reflect how humans use language. Both news-style and encyclopedia-style writing represent a pretty narrow slice.

Reply View 21 replies

creshal 10 months ago

That's why search engines rated them highly, and why a million spam sites cropped up that paid writers $1/essay to pretend to be Aunt May, and why today every recipe website has a gigantic useless fake essay in front of their copypasted made up recipes.

Reply View | 20 replies
- Freak_NL 10 months ago
  
  I hate how looking for recipes has become so… disheartening. Online recipes are fine for reputable sources like newspapers where professional recipe writers are paid for their contributions, but searching for some Aunt May's recipe for 'X' in the big ocean of the internet is pointless — too much raw sewage dumped in.
  It sucks, because sharing recipes seemed like one of those things the internet could be really good at.
  
  Reply View | 9 replies
  
  smallerfish 10 months ago
  
  There seem to be quite a few recipe sharing sites around - e.g. allrecipes.com.
  
  Reply View | 7 replies
  
  c6400sc 10 months ago
  
  It's interesting to search for recipes in other languages and not find junk as we do in English.
  I read Spanish and Italian fluently and stumble my way through Japanese (with translation). It's easier to find a good recipe in these languages, provided you can find the ingredients or substitutes.
  
  Reply View | 0 replies
- shagie 10 months ago
  
  I wish more people presented recipes like cooking for engineers. For example - Meat Lasagna https://www.cookingforengineers.com/recipe/36/Meat-Lasagna
  
  Reply View | 6 replies
  
  bhasi 10 months ago
  
  I love the table-diagrams at the end. I've never seen anything like that until now and it really seems useful for visualization of the recipe and the sequence of steps.
  
  Reply View | 2 replies
  
  grues-dinner 10 months ago
  
  And here I thought my defacement of printed recipes by bracketing everything that goes together at each stage was just me. There are, well, maybe not dozens but at least two of us! Saves a lot of bowls when you know without further checking that you can, say, just dump the flour and sugar, butter and eggs into the big bowl without having to prepare separately because they're in the "1: big bowl" bracket.
  
  Reply View | 2 replies
- darby_nine 10 months ago
  
  Ok, but what i said is true regardless of SEO, and that SEO has also fed back into english before LLMs were a thing. If you only train on those subsets you'll also end up with a chatbot that doesn't speak in a way we'll identify as natural english.
  
  Reply View | 2 replies
  
  actionfromafar 10 months ago
  
  Yet. Give it time. The LLMs will train our future children.
  
  Reply View | 1 reply
  
  darby_nine 10 months ago
  
  I'm sure they already are.
  
  Reply View | 0 replies

Lalabadie 10 months ago

The current state of things leads me to believe that Google's current ranking system has been somehow too transparent for the last 2-3 years.

The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.

Reply View 0 replies