Comment by Hizonner
> The difference between that and the LLM training data scraping
Is the traffic that people are complaining about really training traffic?
My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.
That doesn't seem like enough traffic to be a really big problem.
On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.
Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.
Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.
So what's really going on here? Anybody actually know?
The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.
There's some user-directed traffic, but it's a small fraction, in my experience.