geocar 3 hours ago

Do you actually use this?

    $ md5 How\ I\ Block\ All\ 26\ Million\ Of\ Your\ Curl\ Requests.html
MD5 (How I Block All 26 Million Of Your Curl Requests.html) = e114898baa410d15f0ff7f9f85cbcd9d

(downloaded with Safari)

    $ curl https://foxmoss.com/blog/packet-filtering/ | md5sum
    e114898baa410d15f0ff7f9f85cbcd9d  -
I'm aware of curl-impersonate https://github.com/lwthiker/curl-impersonate which works around these kinds of things (and makes working with cloudflare much nicer), but serious scrapers use chrome+usb keyboard/mouse gadget that you can ssh into so there's literally no evidence of mechanical means.

Also: If you serve some Anubis code without actually running the anubis script in the page, you'll get some answers back so there's at least one anubis-simulator running on the Internet that doesn't bother to actually run the JavaScript it's given.

Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?

  • jacquesm 2 hours ago

    > Also also: 26M requests daily is only 300 requests per second and Apache could handle that easily over 15 years ago. Why worry about something as small as that?

    That doesn't matter, does it? Those 26 million requests could be going to actual users instead and 300 requests per second is non-trivial if the requests require backend activity. Before you know it you're spending most of your infra money on keeping other people's bots alive.

    • arcfour an hour ago

      Blocking 26M bot requests doesn't mean 26M legitimate requests magically appear to take their place. The concern is that you're spending infrastructure resources serving requests that provide zero business value. Whether that matters depends on what those requests actually cost you. As the original commenter pointed out, this is likely not very much at all.

  • dancek an hour ago

    The article talks about 26M requests per second. It's theoretical, of course.

coppsilgold 2 hours ago

There are also HTTP fingerprints. I believe it's named after akamai or something.

All of it is fairly easy to fake. JavaScript is the only thing that poses any challenge and what challenge it poses is in how you want to do it with minimal performance impact. The simple truth is that a motivated adversary can interrogate and match every single minor behavior of the browser to be bit-perfect and there is nothing anyone can do about it - except for TPM attestations which also require a full jailed OS environment in order to control the data flow to the TPM.

Even the attestation pathway can probably be defeated, either through the mandated(?) accessibility controls or going for more extreme measures. And putting the devices to work in a farm.

  • delusional an hour ago

    This is exactly right, and it's why I believe we need to solve this problem in the human domain, with laws and accountability. We need new copyrights that cover serving content on the web, and gives authors control over who gets to access that content, WITHOUT requiring locked down operating systems or browser monopolies.

seba_dos1 3 days ago

> with tools like Anubis being largely ineffective

To the contrary - if someone "bypasses" Anubis by setting the user agent to Googlebot (or curl), it means it's effective. Every Anubis installation I've been involved with so far explicitly allowed curl. If you think it's counterproductive, you probably just don't understand why it's there in the first place.

  • jgalt212 2 days ago

    If you're installing Anubis, why are you setting it to allow curl to bypass?

    • seba_dos1 2 days ago

      The problem you usually attempt to alleviate by using Anubis is that you get hit by load generated by aggressive AI scrappers that are otherwise indistinguishable from real users. As soon as the bot is polite enough to identify as some kind of a bot, the problem's gone, as you can apply your regular measures for rate limiting and access control now.

      (yes, there are also people who use it as an anti-AI statement, but that's not the reason why it's used on the most high-profile installations out there)

      • stingraycharles 5 hours ago

        Yeah that makes sense. Bad players will try to look like a regular browser, good players will have no problems revealing they’re a bot.

        • [removed] 4 hours ago
          [deleted]
mandatory 5 hours ago
  • benatkin 3 hours ago

    > NOTE: Due to many WAFs employing JavaScript-level fingerprinting of web browsers, thermoptic also exposes hooks to utilize the browser for key steps of the scraping process. See this section for more information on this.

    This reminds me of how Stripe does user tracking for fraude detection https://mtlynch.io/stripe-update/ I wonder if thermoptic could handle that.

  • joshmn 3 hours ago

    Work like this is incredible. I did not know this existed. Thank you.

    • mandatory an hour ago

      Thanks :) if you have any issues with it let me know.

piggg an hour ago

Blocking on ja3/ja4 signals to folks exactly what you are up to. This is why bad actors doing ja3 randomization became a thing in the last few years and made ja3 matching useless.

Imo use ja3/ja4 as a signal and block on src IP. Don't show your cards. Ja4 extensions that use network vs http/tls latency is also pretty elite to identify folks proxying.

palmfacehn an hour ago

It is a cute technique, but I would prefer if the fingerprint were used higher up in the stack. The fingerprint should be compared against the User-Agent. I'm more interested in blocking curl when it is specifically reporting itself as Chrome/x.y.z.

Most of the abusive scraping is much lower hanging fruit. It is easy to identify the bots and relate that back to ASNs. You can then block all of Huawei cloud and the other usual suspects. Many networks aren't worth allowing at this point.

For the rest, the standard advice about performant sites applies.

keanb 3 days ago

Those bots would be really naive not to use curl-impersonate. I basically use it for any request I make even if I don’t expect to be blocked because why wouldn’t I.

  • VladVladikoff 3 hours ago

    A lot of the bots are compromised servers (eg hacked Wordpress sites), with limited control over what the TLS fingerprints look like.

  • f4uCL9dNSnQm 3 days ago

    There are plenty of naive bots. That is why tar pits work so great at trapping them in. And this TLS based detection looks just like offline/broken site to bots, it will be harder to spot unless you are trying to scrap only that one single site.

  • _boffin_ 2 days ago

    I heard about curl-impersonate yesterday when I was hitting a CF page. Did something else to completely bypass it, which has been successful, but should try this.

unwind 3 days ago

I got exactly this far:

    uint8_t *data = (void *)(long)ctx->data;
before I stopped reading. I had to go look up the struct xdp_md [1], it is declared like this:

    struct xdp_md {
        __u32 data;
        __u32 data_end;
        __u32 data_meta;
        /* ... further fields elided ... */
    };
So clearly the `data` member is already an integer. The sane way to cast it would be to cast to the actual desired destination type, rather than first to some other random integer and then to a `void` pointer.

Like so:

    uint8_t * const data = (uint8_t *) ctx->data;
I added the `const` since the pointer value is not supposed to change, since we got it from the incoming structure. Note that that `const` does not mean we can't write to `data` if we feel like it, it means the base pointer itself can't change, we can't "re-point" the pointer. This is often a nice property, of course.

[1]: https://elixir.bootlin.com/linux/v6.17/source/include/uapi/l...

  • ziml77 2 days ago

    Your code emits a compiler warning about casting an integer to a pointer. Changing the cast to void* emits a slightly different warning about the size of integer being cast to a pointer being smaller than the pointer type. Casting to a long and then a void* avoids both of these warnings.

    • fn-mote 5 hours ago

      Sorry, all that stuff might be true but this whole process is nuts.

      The code segment containing that code looks like a no-op.

      The rest of the post seems sane and well informed, so my theory is that this is a C / packet filtering idiom I’m not aware of, working far from that field.

      Otherwise I’m already freaked out by treating a 32 bit field as a pointer… even if you extend it to first.

      • foxmoss 4 hours ago

        > Otherwise I’m already freaked out by treating a 32 bit field as a pointer… even if you extend it to first.

        The cast from a 32 bit pointer to a 64 bit pointer is in fact an eBPF oddity. So what's happening here is that the virtual machine is just giving us a fake memory address just to use in the program and when the read actually needs to happen the kernel just rewrites the virtual addresses to the real ones. I'm assuming this is just a byproduct of the memory separation that eBPF does to prevent filters from accidentally reading kernel memory.

        Also yes the double cast is just to keep the compiler from throwing a warning.

      • mbac32768 4 hours ago

        Yeah it's freaky. It's C code but it targets the eBPF virtual machine.

  • baobun an hour ago

    Possibly stupid question: Why does the author use different types for data and data_end in their struct?

OutOfHere 3 hours ago

I guess we'll just throw containerized headless browsers at you those like you then. It'll only cost you more.