Comment by derefr

> those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet

The problem is that, of the signals you mention,

• the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")

• and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.

Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.

And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.

Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)