Comment by terminalshort

Comment by terminalshort 10 hours ago

18 replies

How real is this "crawler plague" that the author refers to? I haven't seen it. But that's just as likely to because I don't care, and therefore am not looking, as it is to be because it's not there. Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?

ApeWithCompiler 9 hours ago

The following is the best I could collect quickly to provide backup to the statement. Unfortunally it's not the high quality first instance of raw statistics I would have liked.

But from what I have read time to time the crawler acted magnitudes outside of what could have been accepted as just badly configured.

https://herman.bearblog.dev/the-great-scrape/

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

https://lwn.net/Articles/1008897/

https://tecnobits.com/en/AI-crawlers-on-Wikipedia-platform-d...

https://boston.conman.org/2025/08/21.1

hombre_fatal 8 hours ago

My forum traffic went up 10x due to bots a few months ago. Never seen anything like it.

> Loading static pages from CDN to scrape training data takes such minimal amounts of resources that it's never going to be a significant part of my costs. Are there cases where this isn't true?

Why did you bring up static pages served by a CDN, the absolute best case scenario, as your reference for how crawler spam might affect server performance?

  • senko 7 hours ago

    Not OP, but many technologies nowadays push users to use a server-side component when not needed.

    An example is NextJS where you're strongly encouraged[0] to run a server (or use a platform like Vercel), even if what you're doing is a fairly simple static site.

    Combine inconsiderate crawler (AI or otherwise) with a server-side logic that doesn't really need to be there and you have a recipe for a crash, a big hosting bill, or both.

    [0] People see https://nextjs.org/docs/app/guides/static-exports#unsupporte... and go "ah shucks I better have a server component then"

  • dehrmann 5 hours ago

    > My forum traffic...

    > Why did you bring up static pages served by a CDN...

    This is easier said than done, but pushing the latest topic snapshot to the CDN whenever a post is made is doable.

n3storm 10 hours ago

My estimation is at least 70% of traffic on small sites 300-3000 daily views, is not human

snowwrestler 10 hours ago

Yes, it’s true. Most sites don’t have a forever cache TTL so a crawler that hits every page on a database-backed site is going to hit mostly uncached pages (and therefore the DB).

I also have a faceted search that some stupid crawler has spent the last month iterating through. Also mostly uncached URLs.

  • n3storm 10 hours ago

    Yeah, or an event plugin where spiders walks every day of several years...

[removed] 9 hours ago
[deleted]
danaris 9 hours ago

It's very real. It's crashed my site a number of times.

zzzeek 8 hours ago

I just had to purchase a cloudflare account to protect two of my sites used for CI that run Jenkins and Gerrit servers. These are resource-hungry java VMs which I have running on a minimally powered server as they are intended to be accessed only by a few people, yet crawlers located in eastern Europe and Asia eventually found it and would regularly drive my CPU up to 500% and make the server unavailable (it should go without saying I have always had a robots.txt on these sites that prohibit all crawling. Such files are a quaint relic of a simpler time). For a couple of years I'd block the various offending IPs, but this past month the crawling resumed again this time intentionally swarmed across hundreds of IP numbers so that I could not easily block them. Cloudflare was able to show me within minutes the entirety of the IP numbers came from a single ASN owned by a very large and well known Chinese company and I blocked the entire ASN. While I could figure out these ASNs manually and get blocklists to add to apache config, Cloudflare makes it super easy showing you the whole thing happening in realtime. You can even tailor the 403 response to send them a custom message, in my case, "ALL of the data you are crawling is on github! Get off these servers and go get it there!" (again sure I could write out httpd config for all of that but who wants to bother). They are definitely providing a really critical service.

  • SoftTalker 6 hours ago

    > intended to be accessed only by a few people

    So why are they open to the entire world?

    • zzzeek 6 hours ago

      open to people who contribute PRs so they can see why their tests failed, also htdigest / htpasswd access is complicated / impossible (depending on use case) to configure with the way jenkins / gerrit authentication itself works, particularly with internal scripts and hooks that need to communicate with them.

  • cm2187 8 hours ago

    Particularly if your users are keen on solving recaptchas over and over.

    • cobbzilla 7 hours ago

      How many users do you think are on the poster’s Jenkins/CI system? Sounded like a personal thing or maybe a small team, I didn’t get the impression it was supposed to be public.

      • cm2187 7 hours ago

        The poster ends with a general comment on the usefulness of cloudflare.

      • zzzeek 3 hours ago

        It's an open source project. It's public .

    • zzzeek 3 hours ago

      I don't even have the captchas turned on. When I get an email that cpu is churning for three hours , cloudflare gives me a quick way to see where the traffic is coming from and I can just block it. Because it's always crawlers, which is the point of this discussion "are there actually crawlers? Yes "