How I block all 26M of your curl requests
(foxmoss.com)100 points by foxmoss 3 days ago
100 points by foxmoss 3 days ago
> Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?
That doesn't matter, does it? Those 26 million requests could be going to actual users instead and 300 requests per second is non-trivial if the requests require backend activity. Before you know it you're spending most of your infra money on keeping other people's bots alive.
Blocking 26M bot requests doesn't mean 26M legitimate requests magically appear to take their place. The concern is that you're spending infrastructure resources serving requests that provide zero business value. Whether that matters depends on what those requests actually cost you. As the original commenter pointed out, this is likely not very much at all.
There are also HTTP fingerprints. I believe it's named after akamai or something.
All of it is fairly easy to fake. JavaScript is the only thing that poses any challenge and what challenge it poses is in how you want to do it with minimal performance impact. The simple truth is that a motivated adversary can interrogate and match every single minor behavior of the browser to be bit-perfect and there is nothing anyone can do about it - except for TPM attestations which also require a full jailed OS environment in order to control the data flow to the TPM.
Even the attestation pathway can probably be defeated, either through the mandated(?) accessibility controls or going for more extreme measures. And putting the devices to work in a farm.
This is exactly right, and it's why I believe we need to solve this problem in the human domain, with laws and accountability. We need new copyrights that cover serving content on the web, and gives authors control over who gets to access that content, WITHOUT requiring locked down operating systems or browser monopolies.
> with tools like Anubis being largely ineffective
To the contrary - if someone "bypasses" Anubis by setting the user agent to Googlebot (or curl), it means it's effective. Every Anubis installation I've been involved with so far explicitly allowed curl. If you think it's counterproductive, you probably just don't understand why it's there in the first place.
The problem you usually attempt to alleviate by using Anubis is that you get hit by load generated by aggressive AI scrappers that are otherwise indistinguishable from real users. As soon as the bot is polite enough to identify as some kind of a bot, the problem's gone, as you can apply your regular measures for rate limiting and access control now.
(yes, there are also people who use it as an anti-AI statement, but that's not the reason why it's used on the most high-profile installations out there)
Good news for curl users: https://github.com/mandatoryprogrammer/thermoptic
> NOTE: Due to many WAFs employing JavaScript-level fingerprinting of web browsers, thermoptic also exposes hooks to utilize the browser for key steps of the scraping process. See this section for more information on this.
This reminds me of how Stripe does user tracking for fraude detection https://mtlynch.io/stripe-update/ I wonder if thermoptic could handle that.
Blocking on ja3/ja4 signals to folks exactly what you are up to. This is why bad actors doing ja3 randomization became a thing in the last few years and made ja3 matching useless.
Imo use ja3/ja4 as a signal and block on src IP. Don't show your cards. Ja4 extensions that use network vs http/tls latency is also pretty elite to identify folks proxying.
It is a cute technique, but I would prefer if the fingerprint were used higher up in the stack. The fingerprint should be compared against the User-Agent. I'm more interested in blocking curl when it is specifically reporting itself as Chrome/x.y.z.
Most of the abusive scraping is much lower hanging fruit. It is easy to identify the bots and relate that back to ASNs. You can then block all of Huawei cloud and the other usual suspects. Many networks aren't worth allowing at this point.
For the rest, the standard advice about performant sites applies.
A lot of the bots are compromised servers (eg hacked Wordpress sites), with limited control over what the TLS fingerprints look like.
There are plenty of naive bots. That is why tar pits work so great at trapping them in. And this TLS based detection looks just like offline/broken site to bots, it will be harder to spot unless you are trying to scrap only that one single site.
I got exactly this far:
uint8_t *data = (void *)(long)ctx->data;
before I stopped reading. I had to go look up the struct xdp_md [1], it is declared like this: struct xdp_md {
__u32 data;
__u32 data_end;
__u32 data_meta;
/* ... further fields elided ... */
};
So clearly the `data` member is already an integer. The sane way to cast it would be to cast to the actual desired destination type, rather than first to some other random integer and then to a `void` pointer.Like so:
uint8_t * const data = (uint8_t *) ctx->data;
I added the `const` since the pointer value is not supposed to change, since we got it from the incoming structure. Note that that `const` does not mean we can't write to `data` if we feel like it, it means the base pointer itself can't change, we can't "re-point" the pointer. This is often a nice property, of course.[1]: https://elixir.bootlin.com/linux/v6.17/source/include/uapi/l...
Your code emits a compiler warning about casting an integer to a pointer. Changing the cast to void* emits a slightly different warning about the size of integer being cast to a pointer being smaller than the pointer type. Casting to a long and then a void* avoids both of these warnings.
Sorry, all that stuff might be true but this whole process is nuts.
The code segment containing that code looks like a no-op.
The rest of the post seems sane and well informed, so my theory is that this is a C / packet filtering idiom I’m not aware of, working far from that field.
Otherwise I’m already freaked out by treating a 32 bit field as a pointer… even if you extend it to first.
> Otherwise I’m already freaked out by treating a 32 bit field as a pointer… even if you extend it to first.
The cast from a 32 bit pointer to a 64 bit pointer is in fact an eBPF oddity. So what's happening here is that the virtual machine is just giving us a fake memory address just to use in the program and when the read actually needs to happen the kernel just rewrites the virtual addresses to the real ones. I'm assuming this is just a byproduct of the memory separation that eBPF does to prevent filters from accidentally reading kernel memory.
Also yes the double cast is just to keep the compiler from throwing a warning.
Do you actually use this?
MD5 (How I Block All 26 Million Of Your Curl Requests.html) = e114898baa410d15f0ff7f9f85cbcd9d(downloaded with Safari)
I'm aware of curl-impersonate https://github.com/lwthiker/curl-impersonate which works around these kinds of things (and makes working with cloudflare much nicer), but serious scrapers use chrome+usb keyboard/mouse gadget that you can ssh into so there's literally no evidence of mechanical means.Also: If you serve some Anubis code without actually running the anubis script in the page, you'll get some answers back so there's at least one anubis-simulator running on the Internet that doesn't bother to actually run the JavaScript it's given.
Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?