Comment by geocar

Comment by geocar 8 hours ago

5 replies

Do you actually use this?

    $ md5 How\ I\ Block\ All\ 26\ Million\ Of\ Your\ Curl\ Requests.html
MD5 (How I Block All 26 Million Of Your Curl Requests.html) = e114898baa410d15f0ff7f9f85cbcd9d

(downloaded with Safari)

    $ curl https://foxmoss.com/blog/packet-filtering/ | md5sum
    e114898baa410d15f0ff7f9f85cbcd9d  -
I'm aware of curl-impersonate https://github.com/lwthiker/curl-impersonate which works around these kinds of things (and makes working with cloudflare much nicer), but serious scrapers use chrome+usb keyboard/mouse gadget that you can ssh into so there's literally no evidence of mechanical means.

Also: If you serve some Anubis code without actually running the anubis script in the page, you'll get some answers back so there's at least one anubis-simulator running on the Internet that doesn't bother to actually run the JavaScript it's given.

Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?

mrb 32 minutes ago

He does use it (I verified it from curl on a recent Linux distro). But he probably blocked only some fingerprints. And the fingerprint depends on the exact OpenSSL and curl versions, as different version combinations will send different TLS ciphers and extensions.

renegat0x0 2 hours ago

What I have seen it is hard to tell what "serious scrapers" use. They use many things. Some use this, some not. This is what I have learned reading webscraping on reddit. Nobody speaks things like that out loud.

There are many tools, see links below

Personally I think that running selenium can be a bottle neck, as it does not play nice, sometimes processes break, even system sometimes requires restart because of things blocked, can be memory hog, etc. etc. That is my experience.

To be able to scale I think you have to have your own implementation. Serious scrapers complain about people using selenium, or derivatives as noobs, who will come back asking why page X does not work in scraping mechanisms.

https://github.com/lexiforest/curl_cffi

https://github.com/encode/httpx

https://github.com/scrapy/scrapy

https://github.com/apify/crawlee

dancek 6 hours ago

The article talks about 26M requests per second. It's theoretical, of course.

jacquesm 7 hours ago

> Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?

That doesn't matter, does it? Those 26 million requests could be going to actual users instead and 300 requests per second is non-trivial if the requests require backend activity. Before you know it you're spending most of your infra money on keeping other people's bots alive.

  • arcfour 6 hours ago

    Blocking 26M bot requests doesn't mean 26M legitimate requests magically appear to take their place. The concern is that you're spending infrastructure resources serving requests that provide zero business value. Whether that matters depends on what those requests actually cost you. As the original commenter pointed out, this is likely not very much at all.