eqvinox 3 days ago

TFA — and most comments here — seem to completely miss what I thought was the main point of Anubis: it counters the crawler's "identity scattering"/sybil'ing/parallel crawling.

Any access will fall into either of the following categories:

- client with JS and cookies. In this case the server now has an identity to apply rate limiting to, from the cookie. Humans should never hit it, but crawlers will be slowed down immensely or ejected. Of course the identity can be rotated — at the cost of solving the puzzle again.

- amnesiac (no cookies) clients with JS. Each access is now expensive.

(- no JS - no access.)

The point is to prevent parallel crawling and overloading the server. Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit. Previously, the server would collapse under thousands of crawler requests per second. That is what Anubis is making prohibitively expensive.

  • qwery 3 days ago

    Yes, I think you're right. The commentary about its (presumed, imagined) effectiveness is very much making the assumption that it's designed to be an impenetrable wall[0] -- i.e. prevent bots from accessing the content entirely.

    I think TFA is generally quite good and has something of a good point about the economics of the situation, but finding the math shake out that way should, perhaps, lead one to question their starting point / assumptions[1].

    In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?

    [0] Mentioning 'impenetrable wall' is probably setting off alarm bells, because of course that would be a bad design.

    [1] (Edited to add:) I should say 'to question their assumptions more' -- like I said, the article is quite good and it does present this as confusing, at least.

    • 1gn15 2 days ago

      > In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?

      I agree, but the advertising is the whole issue. "Checking to see you're not a bot!" and all that.

      Therefore some people using Anubis expect it to be an impenetrable wall, to "block AI scrapers", especially those that believe it's a way for them to be excluded from training data.

      It's why just a few days ago there was a HN frontpage post of someone complaining that "AI scrapers have learnt to get past Anubis".

      But that is a fight that one will never win (analog hole as the nuclear option).

      If it said something like "Wait 5 seconds, our servers are busy!", I would think that people's expectations will be more accurate.

      As a robot I'm really not that sympathetic to anti-bot language backfiring on humans. I have to look away every time it comes up on my screen. If they changed their language and advertising, I'll be more sympathetic -- it's not as if I disagree that overloading servers for not much benefit is bad!

      • qwery a day ago

        Yeah, I think it's obviously a pretty natural conclusion to draw, that {thing for hinder crawler} ≅≅ {thing for stop all crawler}. Perhaps I should have stated that explicitly in the original comment.

        As for the presentation/advertising, I didn't get into it because I don't hold a particularly strong opinion. Well, I do hold a particularly strong opinion, but not one that really distinguishes Anubis from any of the other things. I'm fully onboard with what you're saying -- I find this sort of software extremely hostile and the fact that so many people don't[0] reminds me that I'm not a people.

        In my experience, this particular jump scare is about the same as any of the other services. The website is telling me that I'm not welcome for whatever arbitrary reason it is now, and everyone involved wants me to feel bad.

        Actually there is one thing I like about the Anubis experience[1] compared to the other ones, it doesn't "Would you like to play a game?" me. As a robot I appreciate the bluntness, I guess.

        (the games being: "click on this. now watch spinny. more. more. aw, you lose! try again?", and "wheel, traffic light, wildcard/indistinguishable"[2]).

        [0] "just ignore it, that's what I do" they say. "Oh, I don't have a problem like that. Sucks to be you."

        [1] yes, I'm talking upsides about the experience of getting **ed by it. I would ask how we got here but it's actually pretty easy to follow.

        [2] GCHQ et al. should provide a meatspace operator verification service where they just dump CCTV clips and you have to "click on the squares that contain: UNATTENDED BAG". Call it "phonebooth, handbag, foreign agent".

        (Apologies for all the weird tangents -- I'm just entertaining myself, I think I might be tired.)

  • thayne 3 days ago

    You don't necessarily need JS, you just need something that can detect if Anybis is used and complete the challenge.

    • eqvinox 3 days ago

      Sure, doesn't change anything though; you still need to spend energy on a bunch of hash calculations.

    • rocqua 3 days ago

      But then you rate limit that challenge.

      You could setup a system for parellelizing the creation of these Anubis PoW cookies independent of the crawling logic. That would probably work, but it's a pretty heavy lift compared to 'just run a browser with JavaScript'.

    • [removed] 3 days ago
      [deleted]
  • rocqua 3 days ago

    This is a good point, presuming the rate limiting is actually applied.

  • IshKebab 3 days ago

    Well maybe, but even then, how many parallel crawls are you going to do per site? 100 maybe? You can still get enough keys to do that for all sites in just a few hours per week.

wraptile 3 days ago

I'm a scraper developer and Anubis would have worked 10 - 20 years ago, but now all broad scrapers run on a real headless browser with full cookie support and costs relatively nothing in compute. I'd be surprised if LLM bots would use anything else given the fact that they have all of this compute and engineers already available.

That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale. It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.

You can actually see this in real life if you google web scraping services and which targets they claim to bypass - all of them bypass generic anti-bots like Cloudflare, Akamai etc. but struggle with custom and rare stuff like Chinese websites or small forums because scraping market is a market like any other and high value problems are solved first. So becoming a low value problem is a very easy way to avoid confrontation.

  • jandrese 3 days ago

    > That being said, one point is very correct here - by far the best effort to resist broad crawlers is a _custom_ anti-bot that could be as simple as "click your mouse 3 times" because handling something custom is very difficult in broad scale.

    Isn't this what Microsoft is trying to do with their sliding puzzle piece and choose the closest match type systems?

    Also, if you come in on a mobile browser it could ask you to lay your phone flat and then shake it up and down for a second or something similar that would be a challenge for a datacenter bot pretending to be a phone.

  • DanielHB 3 days ago

    How do you bypass cloudflare? I do some light scrapping for some personal stuff, but I can't figure out how to bypass it. Like do you randomize IPs using several VPNs at the same time?

    I usually just sit there on my phone pressing the "I am not a robot box" when it triggers.

    • wraptile 2 days ago

      It's still pretty hard to bypass it with open source solutions. To bypass CF you need:

      - an automated browser that doesn't leak the fact it's being automated

      - ability to fake the browser fingerprint (e.g. Linux is heavily penalized)

      - residential or mobile proxies (for small scale your home IP is probably good enough)

      - deployment environment that isn't leaked to the browser.

      - realistic scrape pattern and header configuration (header order, referer, prewalk some pages with cookies etc.)

      This is really hard to do at scale but for small personal scripts you can have reasonable results with flavor of the month playwright forks on github like nodriver or dedicated tools like Flaresolver but I'd just find a web scraping api with low entry price and just drop 15$ month and avoid this chase because it can be really time consuming.

      If you're really on budget - most of them offer 1,000 credits for free which will get you avg 100 pages a month per service and you can get 10 of them as they all mostly function the same.

    • hinach4n 3 days ago

      I believe usually you would bypass by using residential ips / proxies?

      • DanielHB 3 days ago

        I run it through my home network and I'm still triggering it. I add 2s delays between page load and it still triggers

        • jijijijij 2 days ago

          Well, if that's true... I am so sorry to tell you this, it looks like you are in fact a robot.

    • 1gn15 2 days ago

      I use Camoufox for the browser and "playwright-captcha" for the CAPTCHA solving action. It's not fully reliable but it works.

  • miki123211 3 days ago

    This only works if you're a low-value site (which admittedly most sites are).

  • hahn-kev 3 days ago

    Bot blocking through obscurity

    • lbhdc 3 days ago

      That's really the only option available here, right? The goal is to keep sites low friction for end users while stopping bots. Requiring an account with some moderation would stop the majority of bots, but it would add a lot of friction for your human users.

      • brookst 3 days ago

        The other option is proof of work. Make clients use JS to do expensive calculations that aren’t a big deal for single clients, but get expensive at scale. Not ideal, but another tool to potentially use.

    • tovej 3 days ago

      I like it, make the bot developers play whack-a-mole.

      Of course, you're going to have to verify each custom puzzle aren't you.

  • sam0x17 3 days ago

    > It took the author just few minutes to solve this but for someone like Perplexity it would take hours of engineering and maintenance to implement a solution for each custom implementation which is likely just not worth it.

    These are trivial for an AI agent to solve though, even with very dumb watered down models.

  • andai 3 days ago

    You can also generate custom solutions at scale with LLMs. So each user could get a different CAPTCHA.

    • josh-sematic 3 days ago

      At that point you’re probably spending more money blocking the scrapers than you would spend just letting them through.

      • lbhdc 3 days ago

        That seems like it would make bot blocking saas (like cloudflare or tollbit) more attractive because it could amortize that effort/cost across many clients.

Arnavion 3 days ago

>This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.

>I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.

No need to mimic the actual challenge process. Just change your user agent to not have "Mozilla" in it; Anubis only serves you the challenge if it has that. For myself I just made a sideloaded browser extension to override the UA header for the handful of websites I visit that use Anubis, including those two kernel.org domains.

(Why do I do it? For most of them I don't enable JS or cookies for so the challenge wouldn't pass anyway. For the ones that I do enable JS or cookies for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)

  • johnecheck 3 days ago

    Sadly, touching the user-agent header more or less instantly makes you uniquely identifiable.

    Browser fingerprinting works best against people with unique headers. There's probably millions of people using an untouched safari on iPhone. Once you touch your user-agent header, you're likely the only person in the world with that fingerprint.

    • sillywabbit 3 days ago

      If someone's out to uniquely identify your activity on the internet, your User-Agent string is going to be the least of your problems.

      • _def 3 days ago

        Not sure what you mean, as exactly this is happening currently on 99% of the web. Brought to you by: ads

    • Arnavion 3 days ago

      UA fingerprinting isn't a problem for me. As I said I only modify the UA for the handful of sites that use Anubis that I visit. I trust those sites enough that them fingerprinting me is unlikely, and won't be a problem even if they did.

    • NoMoreNicksLeft 3 days ago

      I'll set mine to "null" if the rest of you will set yours...

      • gabeio 3 days ago

        The string “null” or actually null? I have recently seen a huge amount of bot traffic which has actually no UA and just outright block it. It’s almost entirely (microsoft cloud) Azure script attacks.

    • codedokode 3 days ago

      If your headers are new every time then it is very difficult to figure out who is who.

      • spoaceman7777 3 days ago

        yes, but it puts you in the incredibly small bucket of "users that has weird headers that don't mesh well", and makes using the rest of the (many) other fingerprinting techniques all the more accurate.

      • kelseydh 3 days ago

        It is very easy unless the IP address is also switching up.

      • heavyset_go 3 days ago

        It's very easy to train a model to identify anomalies like that.

        • johnecheck 2 days ago

          While it's definitely possible to train a model for that, 'very easy' is nonsense.

          Unless you've got some superintelligence hidden somewhere, you'd choose a neural net. To train, you need a large supply of LABELED data. Seems like a challenge to build that dataset; after all, we have no scalable method for classifying as of yet.

    • andrewmcwatters 3 days ago

      Yes, but you can take the bet, and win more often than not, that your adversary is most likely not tracking visitor probabilities if you can detect that they aren't using a major fingerprinting provider.

    • [removed] 3 days ago
      [deleted]
    • jagged-chisel 3 days ago

      I wouldn’t think the intention is to s/Mozilla// but to select another well-known UA string.

      • Arnavion 3 days ago

        The string I use in my extension is "anubis is crap". I took it from a different FF extension that had been posted in a /g/ thread about Anubis, which is where I got the idea from in the first place. I don't use other people's extensions if I can help it (because of the obvious risk), but I figured I'd use the same string in my own extension so as to be combined with users of that extension for the sake of user-agent statistics.

      • soulofmischief 3 days ago

        The UA will be compared to other data points such as screen resolution, fonts, plugins, etc. which means that you are definitely more identifiable if you change just the UA vs changing your entire browser or operating system.

      • throwawayffffas 3 days ago

        I don't think there are any.

        Because servers would serve different content based on user agent virtually all browsers start with Mozilla/5.0...

    • [removed] 3 days ago
      [deleted]
  • Animats 3 days ago

    > (Why do I do it? For most of them I don't enable JS so the challenge wouldn't pass anyway. For the ones that I do enable JS for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)

    Hm. If your site is "sticky", can it mine Monero or something in the background?

    We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"

    • mikestew 3 days ago

      We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"

      Doesn't Safari sort of already do that? "This tab is using significant power", or summat? I know I've seen that message, I just don't have a good repro.

      • qualeed 3 days ago

        Edge does, as well. It drops a warning in the middle of the screen, displays the resource-hogging tab, and asks whether you want to force-close the tab or wait.

  • zahlman 3 days ago

    > Just change your user agent to not have "Mozilla" in it. Anubis only serves you the challenge if you have that.

    Won't that break many other things? My understanding was that basically everyone's user-agent string nowadays is packed with a full suite of standard lies.

    • Arnavion 3 days ago

      It doesn't break the two kernel.org domains that the article is about, nor any of the others I use. At least not in a way that I noticed.

    • throwawayffffas 3 days ago

      In 2025 I think most of the web has moved on from checking user strings. Your bank might still do it but they won't be running Anubis.

      • Aachen 3 days ago

        Nope, they're on cloudflare so that all my banking traffic can be intercepted by a foreign company I have no relation to. The web is really headed in a great direction :)

      • account42 3 days ago

        The web as a whole definitely has not moved on from that.

  • msephton 3 days ago

    I'm interested in your extension. I'm wondering if I could do something similar to force text encoding of pages into Japanese.

    • Arnavion 2 days ago

      If your Firefox supports sideloading extensions then making extensions that modify request or response headers is easy.

      All the API is documented in https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web... . My Anubis extension modifies request headers using `browser.webRequest.onBeforeSendHeaders.addListener()` . Your case sounds like modifying response headers which is `browser.webRequest.onHeadersReceived.addListener()` . Either way the API is all documented there, as is the `manifest.json` that you'll need to write to register this JS code as a background script and whatever permissions you need.

      Then zip the manifest and the script together, rename the zip file to "<id_in_manifest>.xpi", place it in the sideloaded extensions directory (depends on distro, eg /usr/lib/firefox/browser/extensions), restart firefox and it should show up. If you need to debug it, you can use the about:debugging#/runtime/this-firefox page to launch a devtools window connected to the background script.

      • msephton 2 days ago

        Cheers! I'm in Safari so I'll see if there's a match.

  • semiquaver 3 days ago

    Doesn’t that just mean the AI bots can do the same? So what’s the point?

  • danieltanfh95 3 days ago

    wtf? how is this then better than a captcha or something similar?!

  • throw84a747b4 3 days ago

    [flagged]

    • gruez 3 days ago

      >Not only is Anubis a poorly thought out solution from an AI sympathizer [...]

      But the project description describes it as a project to stop AI crawlers?

      > Weighs the soul of incoming HTTP requests to stop AI crawlers

      • throw84a747b4 3 days ago

        Why would a company that wants to stop AI crawlers give talks on LLMs and diffusion models at AI conferences?

        Why would they use AI art for the first Anubis mascot until GitHub users called out the hypocrisy on the issue tracker?

        Why would they use Stable Diffusion art in their blogposts until Mastodon and Bluesky users called them out on it?

      • account42 3 days ago

        AI companies are just as interested in stopping competing crawlers as anyone else.

    • [removed] 3 days ago
      [deleted]
ksymph 4 days ago

This is neither here nor there but the character isn't a cat. It's in the name, Anubis, who is an Egyptian deity typically depicted as a jackal or generic canine, and the gatekeeper of the afterlife who weighs the souls of the dead (hence the tagline). So more of a dog-girl, or jackal-girl if you want to be technical.

  • esperent 3 days ago

    Every representation I've ever seen of Anubis - including remarkably well preserved statues from antiquity - are either a male human body with a canine head, or fully canine.

    This anime girl is not Anubis. It's a modern cartoon characters that simply borrows the name because it sounds cool, without caring anything about the history or meaning behind it.

    Anime culture does this all the time, drawing on inspiration from all cultures but nearly always only paying the barest lip service to the original meaning.

    I don't have an issue with that, personally. All cultures and religions should be fair game as inspiration for any kind of art. But I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source just because they share a name and some other very superficial characteristics.

    • account42 3 days ago

      It's also that the anime style already makes all heads shaped vaguely like felines. Add upwards pointing furry ears and it's not wrong to call it a cat girl.

    • ksymph 3 days ago

      > they share a name and some other very superficial characteristics.

      I wasn't implying anything more than that, although now I see the confusing wording in my original comment. All I meant to say was that between the name and appearance it's clear the mascot is canid rather than feline. Not that the anime girl with dog ears is an accurate representation of the Egyptian deity haha.

    • SnuffBox 2 days ago

      It's refreshing to see a reply as thought out as this in today's day and and age of "move fast and post garbage".

    • qwery 3 days ago

      I think you're taking it a bit too seriously. In turn, I am, of course, also taking it too seriously.

      > I do have an issue with claiming that the newly inspired creation is equivalent in any way to the original source

      Nobody is claiming that the drawing is Anubis or even a depiction of Anubis, like the statues etc. you are interested in. It's a mascot. "Mascot design by CELPHASE" -- it says, in the screenshot.

      Generally speaking -- I can't say that this is what happened with this project -- you would commission someone to draw or otherwise create a mascot character for something after the primary ideation phase of the something. This Anubis-inspired mascot is, presumably, Anubis-inspired because the project is called Anubis, which is a name with fairly obvious connections to and an understanding of "the original source".

      > Anime culture does this all the time, ...

      I don't know what bone you're picking here. This seems like a weird thing to say. I mean, what anime culture? It's a drawing on a website. Yes, I can see the manga/anime influence -- it's a very popular, mainstream artform around the world.

      • esperent 3 days ago

        I like to talk seriously about art, representation, and culture. What's wrong with that? It's at least as interesting as discussing databases or web frameworks.

        In case you feel it needs linking to the purpose of this forum, the art in question here is being forcefully shown to people in a situation that makes them do a massive context switch. I want to look at the linux or ffmpeg source code but my browser failed a security check and now I'm staring at a random anime girl instead. What's the meaning here, what's the purpose behind this? I feel that there's none, except for the library author's preference, and therefore this context switch wasted my time and energy.

        Maybe I'm being unfair and the code author is so wrapped up in liking anime girls that they think it would be soothing to people who end up on that page. In which case, massive failure of understanding the target audience.

        Maybe they could allow changing the art or turning it off?

        > Anime culture does this all the time >> I don't know what bone you're picking here

        I'm not picking any bone there. I love anime, and I love the way it feels so free in borrowing from other cultures. That said, the anime I tend to like is more Miyazaki or Satoshi Kon and less kawaii girls.

  • ChrisRR 3 days ago

    I'm assuming the aversion is more about why young anime girls are popping up, not about what animal it is

    • armada651 3 days ago

      Why is there an aversion though? Is it about the image itself or because of the subculture people are associating with the image?

      • ChrisRR 3 days ago

        Both. I don't want any random pictures of young girls popping up while I'm browsing the web, and why would adults insert pictures of young girls into their project in the first place?

      • octo888 3 days ago

        It's an aversion to the sexualised depiction of girls barely the age of puberty or under the age of consent.

        I'd ask why you /don't/ have an aversion to that?

        (yes, "not all anime" etc...)

  • pak9rabid 3 days ago

    Well, thank you for that. That's a great weight off me mind.

  • JdeBP 3 days ago

    ... but entirely lacking the primary visual feature that Anubis had.

rootsudo 3 days ago

When I instantly read it, I knew it was anubis. I hope the anime catgirls never disapear from that project :)

  • hdndiebf 3 days ago

    This anime thing is the one thing about computer culture that I just don't seem to get. I did not get it as child, when suddenly half of children cartoons became animes and I just disliked the aestheic. I didn't get it in school, when people started reading mangas . I'll probably never get it. Therefore I sincerely hope, they do go away from anubis, so I can further dwell in my ignorance.

    • timcambrant 3 days ago

      I feel the same. It's a distinct part of nerd culture.

      In the '70s, if you were into computers you were most likely also a fan of Star Trek. I remember an anecdote from the 1990s when an entire dial-up ISP was troubleshooting its modem pools because there were zero people connected and they assumed there was an outage. The outage happened to occur exactly while that week's episode of X-Files was airing in their time zone. Just as the credits rolled, all modems suddenly lit up as people connected to IRC and Usenet to chat about the episode. In ~1994 close to 100% of residential internet users also happened to follow X-Files on linear television. There was essentially a 1:1 overlap between computer nerds and sci-fi nerds.

      Today's analog seems to be that almost all nerds love anime and Andy Weir books and some of us feel a bit alienated by that.

      • SnuffBox 2 days ago

        > Today's analog seems to be that almost all nerds love anime and Andy Weir books and some of us feel a bit alienated by that.

        Especially because (from my observation) modern "nerds" who enjoy anime seem to relish at bringing it (and various sex-related things) up at inappropriate times and are generally emotionally immature.

        It's quite refreshing seeing that other people have similar lines of thinking and that I'm not alone in feeling somewhat alienated.

      • cdrini 3 days ago

        I think I'd push back and say that nerd culture is no longer really a single thing. Back in the star trek days, the nerd "community" was small enough that star trek could be a defining quality shared by the majority. Now the nerd community has grown, and there are too many people to have defining parts of the culture that are loved by the majority.

        Eg if the nerd community had $x$ people in the star trek days, now there are more than $x$ nerds who like anime and more than $x$ nerds who dislike it. And the total size is much bigger than both.

    • armada651 3 days ago

      But what if they choose a different image that you don't get? What if they used an abstract modern art piece that no one gets? Oh the horror!

    • Aachen 3 days ago

      You don't have to get it to be able to accept that others like it. Why not let them have their fun?

      This sounds more as though you actively dislike anime than merely not seeing the appeal or being "ignorant". If you were to ignore it, there wouldn't be an issue...

      • account42 3 days ago

        They can have their fun on their personal websites. Subjecting others to your "fun" when you knows it annoys them is not cool.

    • balamatom 3 days ago

      Might've caught on because the animes had plots, instead of considering viewers to have the attention spans of idiots like Western kids' shows (and, in the 21st century, software) tend to do.

      • timcambrant 3 days ago

        I don't think it's relevant to debate if anime or other forms of media is objectively better. But as someone who has never understood anime, I view mainstream western TV series as filled with hours of cleverly written dialogue and long story arches, whereas the little anime I've watched seems to mostly be overly dramatic colorful action scenes with intense screamed dialogue and strange bodily noises. Should we maybe assume that we are both a bit ignorant of the preferences of others?

        • balamatom 3 days ago

          Let's rather assume that you're the kind of person who debates a thing by first saying that it's not relevant to debate, then putting forward a pretty out-of-context comparison, and finally concluding that I should feel bad about myself. That kind of story arc does seem to correlate with finding mainstream Western TV worthwhile; there's something structurally similar to the funny way your thought went.

  • bawolff 3 days ago

    Its nice to see there is still some whimsy on the internet.

    Everything got so corporate and sterile.

    • account42 3 days ago

      Everyone copying the same Japanese cartoon style isn't any better than everyone copying corporate memphis.

      • [removed] 3 days ago
        [deleted]
      • lordhumphrey 3 days ago

        I think it definitively would be. Perhaps a small one, but still

  • ghssds 3 days ago

    As Anubis the egyptian god is represented as a dog-headed human, I thought the drawing was of a dog-girl.

    • nemomarx 3 days ago

      Perhaps a jackal girl? I guess "cat girl" gets used very broadly to mean kemomimi (pardon the spelling) though

  • Der_Einzige 3 days ago

    It's not the only project with an anime girl as its mascot.

    ComfyUI has what I think is a foxgirl as its official mascot, and that's the de-facto primary UI for generating Stable Diffusion or related content.

    • SnuffBox 2 days ago

      I've noticed the word "comfy" used more than usual recently and often by the anime-obsessed, is there cultural relevance I'm not understanding?

      • AlexeyBelov 16 hours ago

        OK, you've been all over this thread being negative and angry. On a new account, which makes it even more sus. Take a break from social media.

  • bakugo 3 days ago

    It's more likely that the project itself will disappear into irrelevance as soon as AI scrapers bother implementing the PoW (which is trivial for them, as the post explains) or figure out that they can simply remove "Mozilla" from their user-agent to bypass it entirely.

    • debugnik 3 days ago

      > as AI scrapers bother implementing the PoW

      That's what it's for, isn't it? Make crawling slower and more expensive. Shitty crawlers not being able to run the PoW efficiently or at all is just a plus. Although:

      > which is trivial for them, as the post explains

      Sadly the site's being hugged to death right now so I can't really tell if I'm missing part of your argument here.

      > figure out that they can simply remove "Mozilla" from their user-agent

      And flag themselves in the logs to get separately blocked or rate limited. Servers win if malicious bots identify themselves again, and forcing them to change the user agent does that.

      • throwawayffffas 3 days ago

        > That's what it's for, isn't it? Make crawling slower and more expensive.

        The default settings produce a computational cost of milliseconds for a week of access. For this to be relevant it would have to be significantly more expensive to the point it would interfere with human access.

      • shkkmo 3 days ago

        The explanation of how the estimate is made is more detailed, but here is the referenced conclusion:

        >> So (11508 websites * 2^16 sha256 operations) / 2^21, that’s about 6 minutes to mine enough tokens for every single Anubis deployment in the world. That means the cost of unrestricted crawler access to the internet for a week is approximately $0.

        >> In fact, I don’t think we reach a single cent per month in compute costs until several million sites have deployed Anubis.

    • skydhash 3 days ago

      It's more about the (intentional?) DDoS from AI scrappers, than preventing them from accessing the content. Bandwidth is not cheap.

    • unclad5968 3 days ago

      Im not on Firefox or any Firefox derivative and I still get anime cat girls making sure I'm not a bot.

      • nemomarx 3 days ago

        Mozilla is used in the user agent string of all major browsers for historical reasons, but not necessarily headless ones or so on.

    • [removed] 3 days ago
      [deleted]
    • dingnuts 3 days ago

      [flagged]

      • verteu 3 days ago

        > PoW increases the cost for the bots which is great. Trivial to implement, sure, but that added cost will add up quickly.

        No, the article estimates it would cost less than a single penny to scrape all pages of 1,000,000 distinct Anubis-guarded websites for an entire month.

      • userbinator 3 days ago

        I thought HN was anti-copyright and anti-imaginary-property, or at least the bulk of its users were. Yet all of a sudden, "but AI!!!!1"?

        a federal crime

        The rest of the world doesn't care.

        • klabb3 3 days ago

          > I thought HN was anti-copyright

          Maybe. But what’s happening is ”copyright for thee not for me”, not a universal relaxation of copyright. This loophole exploitation by behemoths doesn’t advance any ideological goals, it only inflames the situation because now you have an adversarial topology. You can see this clearly in practice – more and more resources are going into defense and protection of data than ever before. Fingerprinting, captchas, paywalls, login walls, etc etc.

      • altairprime 3 days ago

        Don’t forget signed attestations from “user probably has skin in the game” cloud providers like iCloud (already live in Safari and accepted by Cloudflare, iirc?) — not because they identify you but because abusive behavior will trigger attestation provider rate limiting and termination of services (which, in Apple’s case, includes potentially a console kill for the associated hardware). It’s not very popular to discuss at HN but I bet Anubis could add support for it regardless :)

        https://datatracker.ietf.org/wg/privacypass/about/

        https://www.w3.org/TR/vc-overview/

      • shkkmo 3 days ago

        > PoW increases the cost for the bots which is great.

        But not by any meaningful amount as explained in the article. All it actually does is rely on it's obscurity while interfering with legitimate use.

      • nialv7 3 days ago

        > Fuck AI scrapers, and fuck all this copyright infringement at scale.

        Yes, fuck them. Problem is Anubis here is not doing the job. As the article already explains, currently Anubis is not adding a single cent to the AI scrappers' costs. For Anubis to become effective against scrappers, it will necessarily have to become quite annoying for legitimate users.

  • guappa 3 days ago

    We all know it's doomed

    • balamatom 3 days ago

      That's called a self-fulfilling prophecy and is not in fact mandatory to participate in.

      • guappa 3 days ago

        I'm not making any git commits to remove it…

        • balamatom 3 days ago

          Probably talking about different doomed things then, sorry.

bawolff 3 days ago

> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.

Counterpoint - it seems to work. People use anubis because its the best of bad options.

If theory and reality disagree, it means either you are missing something or your theory is wrong.

  • semiquaver 3 days ago

    Counter-counter point: it only stopped them for a few weeks and now it doesn’t work: https://news.ycombinator.com/item?id=44914773

    • jeroenhd 3 days ago

      Geoblocking China and Singapore solves that problem, it seems, at least the non-residential IPs (though I also see a lot of aggressive bots coming from residential IP space from China).

      I wish the old trick of sending CCP-unfriendly content to get the great firewall to kill the connection for you still worked, but in the days of TLS everywhere that doesn't seem to work anymore.

    • Aachen 3 days ago

      Only Huawei so far, no? That could be easy to block on a network level for the time being

      Of course we knew from the beginning that this first stage of "bots don't even try to solve it, no matter the difficulty" isn't a forever solution

      • jeroenhd 3 days ago

        AliCloud also seems to send a more capable scraper army, but so far they're not using botnets ("residential proxies") to hide their bad practices.

sidewndr46 3 days ago

> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans

I'm an unsure if this deadpan humor or if the author has never tried to solve a CAPTCHA that is something like "select the squares with an orthodox rabbi present"

  • classichasclass 3 days ago

    The problem with that CAPTCHA is you're not allowed to solve it on Saturdays.

  • windward 3 days ago

    I wonder if it's an intentional quirk that you can only pass some CAPTCHAs if you're a human who knows what an American fire hydrant or school bus looks like?

    • lproven 2 days ago

      > an American fire hydrant or school bus

      So much this. The first time one asked me to click on "crosswalks", I genuinely had to think for a while as I struggled to remember WTF a "crosswalk" was in AmEng. I am a native English speaker, writer, editor and professionally qualified teacher, but my form of English does not have the word "crosswalk" or any word that is a synonym for it. (It has phrases instead.)

      Our schoolbuses are ordinary buses with a special number on the front. They are no specific colour.

      There are other examples which aren't coming immediately to mind, but it is vexing when the designer of a CAPTCHA isn't testing if I am human but if I am American.

    • latexr 3 days ago

      I doubt it’s intentional. Google (owner of reCAPTCHA) is a US company, so it’s more likely they either haven’t considered what they see every day is far from universal; don’t care about other countries; or specifically just care about training for the US.

    • jeroenhd 3 days ago

      Google demanding I flag yellow cars when asked to flag taxis is the silliest Americanism I've seen. At least the school bus has SCHOOL BUS written all over it and fire hydrants aren't exactly an American exclusive thing.

      On some Russian and Asian site I ran into trouble signing up for a forum using translation software because the CAPTCHA requires me to enter characters I couldn't read or reproduce. It doesn't happen as often as the Google thing, but the problem certainly isn't restricted to American sites!

  • wingworks 3 days ago

    There are also services out that will solve any CAPTCHA for you at a very small cost to you. And an AI company will get steep discounts with the volumes of traffic they do.

    There are some browser extensions for it too, like NopeCHA, it works 99% of the time and saves me the hassle of doing them.

    Any site using CAPTCHA's today is really only hurting there real customers and low hanging fruit.

    Of course this assumes they can't solve the capture themselves, with ai, which often they can.

    • petesergeant 3 days ago

      Yes, but not at a rate that enables them to be a risk to your hosting bill. My understanding is that the goal here isn't to prevent crawlers, it's to prevent overly aggressive ones.

  • bawolff 3 days ago

    Well the problem is that computers got good at basically everything.

    Early 2000s captchas really were like that.

    • ok123456 3 days ago

      The original reCAPTCHA was doing distributed book OCR. It was sold as an altruistic project to help transcribe old books.

      • guappa 3 days ago

        And now they're using us to train car driving AI :(

pkal 3 days ago

Superficial comment regarding the catgirl, I don't get why some people are so adamant and enthusiastic for others to see it, but if you like me find it distasteful and annoying, consider copying these uBlock rules: https://sdf.org/~pkal/src+etc/anubis-ublock.txt. Brings me joy to know what I am not seeing whenever I get stopped by this page :)

  • squigz 3 days ago

    I don't get why so many people find it "distasteful and annoying"

    • pkal 2 days ago

      Can you clarify if you mean that you do no understand the reasons that people dislike these images, or do you find the very idea of disliking it hard to relate to?

      I cannot claim that I understand it well, but my best guess is that these are images that represent a kind of culture that I have encountered both in real-life and online that I never felt comfortable around. It doesn't seem unreasonable that this uneasiness around people with identity-constituting interests in anime, Furries, MLP, medieval LARP, etc. transfers back onto their imagery. And to be clear, it is not like I inherently hate anime as a medium or the idea of anthropomorphism in art. There is some kind of social ineptitude around propagating these _kinds_ of interests that bugs me.

      I cannot claim that I am satisfies with this explanation. I know that the dislike I feel for this is very similar to that I feel when visiting a hacker space where I don't know anyone. But I hope that I could at least give a feeling for why some people don't like seeing catgirls every time I open a repository and that it doesn't necessarily have anything to do with advocating for a "corporate soulless web".

    • account42 3 days ago

      You could respect it without "getting" it though.

    • IshKebab 3 days ago

      I can't really explain it but it definitely feels extremely cringeworthy. Maybe it's the neckbeard sexuality or the weird furry aspect. I don't like it.

sugarpimpdorsey 3 days ago

Every time I see one of these I think it's a malicious redirect to some pervert-dwelling imageboard.

On that note, is kernel.org really using this for free and not the paid version without the anime? Linux Foundation really that desperate for cash after they gas up all the BMW's?

  • qualeed 3 days ago

    It's crazy (especially considering anime is more popular now than ever; netflix alone is making billions a year on anime) that people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard".

    • magicalhippo 3 days ago

      > people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard"

      Think you can thank the furries for that.

      Every furry I've happened to come across was very pervy in some way, and so that what immediately comes to mind when I see furry-like pictures like the one shown in the article.

      YMMV

      • voidUpdate 3 days ago

        Out of interest, how many furries have you met? I've been to several fur meets, and have met approximately three furries who I would not want to know anymore for one reason or another

    • Seattle3503 3 days ago

      To be fair, that's the sort of place where I spend most of my free time.

    • gruez 3 days ago

      "Anime pfp" stereotype is alive and well.

    • ants_everywhere 3 days ago

      they've seized the moment to move the anime cat girls off the Arch Linux desktop wallpapers and onto lore.kernel.org.

    • account42 3 days ago

      It's not crazy at all that anyone who has been online for more than a day has that association.

    • turtletontine 3 days ago

      Even if the images aren’t the kind of sexualized (or downright pornographic) content this implies… having cutesy anime girls pop up when a user loads your site is, at best, wildly unprofessional. (Dare I say “cringe”?) For something as serious and legit as kernel.org to have this, I do think it’s frankly shocking and unacceptable.

    • mvdtnz 3 days ago

      [flagged]

      • qualeed 3 days ago

        >If you don't get pedophile vibes from that picture it's on you.

        Wow, what an absolutely wild statement. I hate to break it to you, but I'm not the one sexualizing the cartoon picture.

  • Dilettante_ 3 days ago

    For me it's the flipside: It makes me think "Ahh, my people!"

  • creatonez 3 days ago

    Huh, why would they need the unbranded version? The branded version works just fine. It's usually easier to deploy ordinary open source software than it is for software that needs to be licensed, because you don't need special download pages or license keys.

    If it makes sense for an organization to donate to a project they rely on, then they should just donate. No need to debrand if it's not strictly required, all that would do is give the upstream project less exposure. For design reasons maybe? But LKML isn't "designed" at all, it has always exposed the raw ugly interface of mailing list software.

    Also, this brand does have trust. Sure, I'm annoyed by these PoW captcha pages, but I'm a lot more likely to enable Javascript if it's the Anubis character, than if it is debranded. If it is debranded, it could be any of the privacy-invasive captcha vendors, but if it's Anubis, I know exactly what code is going to run.

    • rustystump 3 days ago

      If i saw an anime pic show up, thatd be a flag. I only know of Anubis’ existence and use of anime from hn.

      It is only trusted by a small subset of people who are in the know. It is not about “anime bad” but that a large chunk of the population isnt into it for whatever reason.

      I love anime but it can also be cringe. I find this cringe as it seems many others do too.

  • Lammy 3 days ago

    [flagged]

    • sugarpimpdorsey 3 days ago

      > Anubis is a clone of Kiwiflare, not an original work, so you're actually sort of half-right:

      Interesting. That itself appears to be a clone of haproxy-protection. I know there has also been an nginx module that does the same for some time. Either way, proof-of-work is by this point not novel.

      Everyone seems to have overlooked the more substantive point of my comment which is that it appears kernel.org cheaped out and is using the free version of Anubis, instead of paying up to support the developer for his work. You know they have the money to do it.

      In 2024 the Linux Foundation reported $299.7M in expenses, with $22.7M of that going toward project infrastructure and $15.2M on "event services" (I guess making sure the cotton candy machines and sno-cone makers were working at conferences).

      My point is, cough up a few bucks for a license you chiselers.

      • prmoustache 3 days ago

        > Everyone seems to have overlooked the more substantive point of my comment which is that it appears kernel.org cheaped out and is using the free version of Anubis, instead of paying up to support the developer for his work. You know they have the money to do it. > > In 2024 the Linux Foundation reported $299.7M in expenses, with $22.7M of that going toward project infrastructure and $15.2M on "event services" (I guess making sure the cotton candy machines and sno-cone makers were working at conferences). > > My point is, cough up a few bucks for a license you chiselers.

        Several points:

        - there is no license to pay. This is free (as in open source and as in beer) software. There is commercial support if you feel you need it and sponsoring options however. Sponsoring is not paying a license.

        - Sometimes it takes so long to get approval for a sponsor that large org member give up.

        - Obviously kernel.org is using an old release of anubis so they likely observed a huge spike in bandwith used at some point and used anubis, solving the problem immediately. I don't remember anubis proposing a paid license at the time of the early releases. I may be wrong but it may be that kernel.org admins have never heard of the possibly of sponsoring nor are they interested in support.

        - you don't have to pay anythinf to change/remove the image and the people who implemented this clearly do not care as they didn't do it.

        - do we have evidence that the anubis developer ever donated directly or indirectly to Linus Torvalds and the thousands of developers who worked on the kernel?

    • creatonez 3 days ago

      Anubis has nothing to do with Kiwiflare, there's no connection at all. It's not the same codebase, and the inspiration for Anubis comes from Hashcash (1997) and numerous other examples of web PoW that predate Kiwiflare, which perhaps tens of thousands of websites were already using as an established technique. What makes you think it is a clone of it?

    • efilife 3 days ago

      Can somebody please explain why was this comment flagged to death? I seem to be missing something

      • ufo 3 days ago

        Possibly because it links to kiwifarms (nasty website to say the least)

      • creatonez 3 days ago

        Well, it's both complete misinformation and attempts to tie a reputable open source project to an unrelated harassment and stalking website.

    • fortran77 3 days ago

      I saw the description and thought "Wow! That works just like the DDOS retarding" of KiwiFlare. I didn't know it was a proper fork of it.

bogwog 3 days ago

I wonder if the best solution is still just to create link mazes with garbage text like this: https://blog.cloudflare.com/ai-labyrinth/

It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?

  • ronsor 3 days ago

    That won't work, because garbage data is filtered after the full dataset is collected anyway. Every LLM trainer these days knows that curation is key.

    • bogwog 2 days ago

      If the "garbage data" is AI generated, it'll be hard or impossible to filter.

  • creatonez 3 days ago

    Crawlers already know how to stop crawling recursive or otherwise excessive/suspicious content. They've dealt with this problem long before LLM-related crawling.

ok123456 3 days ago

Why is kernel.org doing this for essentially static content? Cache control headers and ETAGS should solve this. Also, the Linux kernel has solved the C10K problem.

  • mixologic 3 days ago

    Because its static content that is almost never cached because its infrequently accessed. Thus, almost every hit goes to the origin.

    • ok123456 2 days ago

      The contents in question are statically generated, 1-3 KB HTML files. Hosting a single image would be the equivalent of cold serving 100s of requests.

      Putting up a scraper shield seems like it's more of a political statement than a solution to a real technical problem. It's also antithetical to open collaboration and an open internet of which Linux is a product.

  • whatevaa 3 days ago

    Bots don't respect that.

    • 1gn15 3 days ago

      Use a CDN.

      • trenchpilgrim 3 days ago

        A great option for most people, and indeed Anubis' README recommends using Cloudflare if possible. However, not everyone can use a paid CDN. Some people can't pay because their payment methods aren't accepted. Some people need to serve content or to countries which a major CDN can't for legal and compliance reasons. Some organizations need their own independent infrastructure to serve their organizational misson.

      • Aachen 2 days ago

        So that someone else pays for your bandwidth while seeing who is interested in this content? Idk about that solution

ChocolateGod 3 days ago

I have a S24 (flagship of 2024) and Anubis often takes 10-20 seconds to complete, that time is going to add up if more and more sites adopt it, leaning to a worse browsing experience and wasted battery life.

Meanwhile AI farms will just run their own nuclear reactors eventually and be unaffected.

I really don't understand why someone thought this was a good idea, even if well intentioned.

  • prmoustache 3 days ago

    Something must be wrong on your flagship smartphone because I have an entry level one that doesn't take that long.

    It seems there is a large number of operations crawling the web to build models that aren't using directly infrastructure hosted on AI farms BUT botnet running on commodity hardware and residencial networks to circumvent their ip range from being blacklisted. Anubis point is to block those.

  • Aachen 2 days ago

    Which browser and which difficulty setting is that?

    Because I've got the same model line but about 3 or 4 years older and it usually just flashes by in the browser Lightning from F-droid which is an OS webview wrapper. On occasion a second or maybe two, I assume that's either bad luck in finding a solution or a site with a higher difficulty setting. Not sure if I've seen it in Fennec (firefox mobile) yet but, if so, it's the same there

    I've been surprised that this low threshold stops bots but I'm reading in this thread that it's rather that bot operators mostly just haven't bothered implementing the necessary features yet. It's going to get worse... We've not even won the battle let alone the war. Idk if this is going to be sustainable, we'll see where the web ends up...

  • jeroenhd 3 days ago

    Either your phone is on some extreme power saving mode, your ad blocker is breaking Javascript, or something is wrong with your phone.

    I've certainly seen Anubis take a few seconds (three or four maybe) but that was on a very old phone that barely loaded any website more complex than HN.

  • vova_hn 3 days ago

    I have Pixel 7 (released in 2022) and it usually takes less than a second...

  • TZubiri 3 days ago

    I remember that LiteCoin briefly had this idea, to be easy on consumer hardware but hard on GPUs. The ASICs didn't take long to obliterate the idea though.

    Maybe there's going to be some form of pay per browse system? even if it's some negligible cost on the order of 1$ per month (and packaged with other costs), I think economies of scale would allow servers to perform a lifetime of S24 captchas in a couple of seconds.

  • whatevaa 3 days ago

    Something is wrong with your flagship if it takes that long.

    • ChocolateGod 3 days ago

      Samsung's UI has this feature where it turns on power saving mode when it detects light use.

    • prmoustache 3 days ago

      I guess his flagship IS compromised and part of an AI crawling botnet ;-)

    • Lammy 2 days ago

      You're looking at it wrong.

WesolyKubeczek 4 days ago

I disagree with the post author in their premise that things like Anubis are easy to bypass if you craft your bot well enough and throw the compute at it.

Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.

Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.

Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?

  • int_19h 3 days ago

    It is the way it is because there are easy pickings to be made even with this low effort, but the more sites adopt such measures, the less stupid your average bot will be.

  • busterarm 3 days ago

    Those are just the ones that you've managed to ID as bots.

    Ask me how I know.