Comment by terminalshort
Comment by terminalshort 10 hours ago
How real is this "crawler plague" that the author refers to? I haven't seen it. But that's just as likely to because I don't care, and therefore am not looking, as it is to be because it's not there. Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?
The following is the best I could collect quickly to provide backup to the statement. Unfortunally it's not the high quality first instance of raw statistics I would have liked.
But from what I have read time to time the crawler acted magnitudes outside of what could have been accepted as just badly configured.
https://herman.bearblog.dev/the-great-scrape/
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
https://lwn.net/Articles/1008897/
https://tecnobits.com/en/AI-crawlers-on-Wikipedia-platform-d...
https://boston.conman.org/2025/08/21.1