leumon 4 days ago

Seems like ai bots are indeed bypassing the challenge by computing it: https://social.anoxinon.de/@Codeberg/115033790447125787

  • debugnik 4 days ago

    That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.

    This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.

    • NoGravitas 3 days ago

      The point is that it will always be cheaper for bot farms to pass the challenge than for regular users.

      • debugnik 3 days ago

        Why does that matter? The challenge needs to stay expensive enough to slow down bots, but legitimate users won't be solving anywhere near the same amount of challenges and the alternative is the site getting crawled to death, so they can wait once in a while.

      • bawolff 3 days ago

        It might be a lot closer if they were using argon2 instead of sha. Sha is a kind of bad choice for this sort of thinh.

    • hiccuphippo 4 days ago

      Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.

      • bawolff 3 days ago

        Most of those alt-coins are kind of fake/scams. Its really hard to make it work with actually useful problems.

      • kevincox 4 days ago

        Of course that doesn't directly help the site operator. Maybe it could actually do a bit of bitcoin mining for the site owner. Then that could pay for the cost of accessing the site.

    • danieltanfh95 3 days ago

      this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.

      reducing the problem to a cost issue is bound to be short sighted.

      • r0uv3n 3 days ago

        This is not about preventing crawling entirely, it's about finding a way to prevent crawlers from repeatedly everything way too frequently just because crawling is just very cheap. Of course it will always be worth it to crawl the Linux Kernel mailing list, but maybe with a high enough cost per crawl the crawlers will learn to be fine with only crawling it once per hour for example

        • danieltanfh95 a day ago

          my comment is not about preventing crawling, its stating that with how much revenue AI is bringing (real or not), the value of crawling repeatedly >>> the cost of running these flimsy coin mining algorithms.

          At the very least captcha at least tries to make the human-ai distinction, but these algorithms are just purely on the side of making it "expensive". if its just a capital problem, then its not a problem for these big corpo who are the ones who are incentivized to do so in the first place!

          even if human captcha solvers are involved, at the very least it provides the society with some jobs (useless as it may be), but these mining algorithms also do society no good, and wastes compute for nothing!

  • [removed] 4 days ago
    [deleted]
johnea 4 days ago

My biggest bitch is that it requires JS and cookies...

Although the long term problem is the business model of servers paying for all network bandwidth.

Actual human users have consumed a minority of total net bandwidth for decades:

https://www.atom.com/blog/internet-statistics/

Part 4 shows bots out using humans in 1996 8-/

What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.

The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.

This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.

So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.

Rational predictions are that it's not going to end well...

  • jerf 4 days ago

    "Although the long term problem is the business model of servers paying for all network bandwidth."

    Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.

    • Imustaskforhelp 4 days ago

      The Ai scrapers are most likely vc funded and all they care about is getting as much data as possible and not worry about the costs.

      They are hiring machines at scale too so definitely bandwidth etc. are cheaper for them too. Maybe use a provider that doesn't have too much bandwidth issues (hetzner?)

      But still, the point being that you might be hosting website on your small server and that scraper with its machines beast can come and effectively ddos your server looking for data to scrape. Deterring them is what matters so that the economical scale finally slide back to our favours again.

    • johnea 3 days ago

      Maybe my statement wasn't clear. The point is that the server operators pay for all of the bandwidth of access to their servers.

      When this access is beneficial to them, that's OK, when it's detrimental to them, they're paying for their own decline.

      The statement isn't really concerned with what if anything the scraper operators are paying, and I don't think that really matters in reaching the conclusion.

  • Hizonner 4 days ago

    > The difference between that and the LLM training data scraping

    Is the traffic that people are complaining about really training traffic?

    My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.

    That doesn't seem like enough traffic to be a really big problem.

    On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.

    Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.

    Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.

    So what's really going on here? Anybody actually know?

    • zerocrates 4 days ago

      The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.

      There's some user-directed traffic, but it's a small fraction, in my experience.

    • ncruces 3 days ago

      It's not random internet people saying it's training. It's Cloudflare, among others.

      Search for “A graph of daily requests over time, comparing different categories of AI Crawlers” on this blog: https://blog.cloudflare.com/ai-labyrinth/

    • Dylan16807 4 days ago

      The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.

      • Hizonner 4 days ago

        That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.

        But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?

        I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?

        The questions just multiply.

        • Dylan16807 4 days ago

          It's near-stock mediawiki, so it has a ton of old versions and diffs off the history tab but I'd expect a crawler to be able to handle it.

jimmaswell 4 days ago

What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?

  • themafia 4 days ago

    If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.

    Search engines, at least, are designed to index the content, for the purpose of helping humans find it.

    Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.

    This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."

    • jimmaswell 3 days ago

      > copyright attribution

      You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.

      • heavyset_go 3 days ago

        LLMs quite literally work at the level of their source material, that's how training works, that's how RAG works, etc.

        There is no proof that LLMs work at the level of "ideas", if you could prove that, you'e solve a whole lot of incredibly expensive problems that are current bottlenecks for training and inference.

        It is a bit ironic that you'd call someone wanting to control and be paid for the thing they themselves created "selfish", while at the same time writing apologia on why it's okay for a trillion dollar private company to steal someone else's work for their own profit.

        It isn't some moral imperative that OpenAI gets access to all of humanity's creations so they can turn a profit.

      • [removed] 3 days ago
        [deleted]
    • [removed] 3 days ago
      [deleted]
    • [removed] 3 days ago
      [deleted]
  • marvinborner 3 days ago

    As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.

    [1] https://types.pl/@marvin/114394404090478296

    • squaresmile 3 days ago

      Same, ClaudeBot makes a stupid amount of requests on my git storage. I just blocked them all on Cloudflare.

  • dilDDoS 4 days ago

    As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.

    • benou 4 days ago

      Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.

      Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.

      • johnnyanmac 4 days ago

        >Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.

        a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.

  • Philpax 4 days ago

    Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163

    • zahlman 4 days ago

      Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?

      • NobodyNada 4 days ago

        My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.

        It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.

    • immibis 4 days ago

      Why haven't they been sued and jailed for DDoS, which is a felony?

      • ranger_danger 4 days ago

        Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.

      • Symbiote 3 days ago

        Many are using botnets, so it's not practical to find out who they are.

  • blibble 4 days ago

    they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens

    either way the result is the same: they induce massive load

    well written crawlers will:

      - not hit a specific ip/host more frequently than say 1 req/5s
      - put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
      - limit crawling depth based on crawled page quality and/or response time
      - respect robots.txt
      - make it easy to block them
    • Aachen 3 days ago

      - wait 2 seconds for a page to load before aborting the connection

      - wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down

      I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites

    • [removed] 4 days ago
      [deleted]
userbinator 3 days ago

As I've been saying for a while now - if you want to filter for only humans, ask questions only a human can easily answer; counting the number of letters in a word seems to be a good way to filter out LLMs, for example. Yes, that can be relatively easily gotten around, just like Anubis, but with the benefit that it doesn't filter out humans and has absolutely minimal system requirements (a browser that can submit HTML forms), possibly even less than the site itself.

There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)

  • ack_complete 3 days ago

    For smaller forums, any customization to the new account process will work. When I ran a forum that was getting a frustratingly high amount of spammer signups, I modified the login flow to ask the user to add 1 to the 6-digit number in the stock CAPTCHA. Spam signups dropped like a rock.

  • never_inline 3 days ago

    > counting the number of letters in a word seems to be a good way to filter out LLMs

    As long as this challenge remains obscure enough to be not worth implementing special handlers in the crawler, this sounds a neat idea.

    But I think if everyone starts doing this particular challenge (char count), the crawlers will start instructing a cheap LLM to do appropriate tool calls and get around it. So the challenge needs to be obscure.

    I wonder if anyone tried building a crawler-firewall or even nginx script which will let the site admin plug their own challenge generator in lua or something, which would then create a minimum HTML form. Maybe even vibe code it :)

  • soared 3 days ago

    Tried and true method! An old video game forum named moparscape used to ask what mopar was and I always had to google it

    • Aachen 3 days ago

      Good thing modern bots can't do a web search!

      • userbinator 2 days ago

        They will be as likely if not more so to fall victim to the large amount of misinformation... and AI-generated crap you'll find from doing so.

  • cm2012 3 days ago

    There is a decent segment of the population that will gave a hard time with that.

    • wavemode 3 days ago

      So it's no different from real CAPTCHAs, then.

hansjorg 4 days ago

If you want a tip my friend, just block all of Huawei Cloud by ASN.

  • wging 3 days ago

    ... looks like they did: https://github.com/TecharoHQ/anubis/pull/1004, timestamped a few hours after your comment.

    • scratchyone 3 days ago

      lmfao so that kinda defeats the entire point of this project if they have to resort to a manual IP blocklist anyways

      • BLKNSLVR 3 days ago

        I would actually say that it's been successful in determining at least one, so far, large scale abuser, which can the be blocked via more traditional methods.

        I have my own project that finds malicious traffic IP addresses, and through searching through the results, it's allowed me to identify IP address ranges to be blocked completely.

        Yielding useful information may not have been what it was designed to do, but it's still a useful outcome. Funny thing about Anubis' viral popularity is that it was designed to just protect the author's personal site from a vast army of resource-sucking marauders, and grew because it was open sourced and a LOT of other people found it useful and effective.

        • sandywaffles 2 days ago

          I think that was already common knowledge as hansjorg above suggests

iefbr14 4 days ago

I wouldn't be surprised if just delaying the server response by some 3 seconds will have the same effect on those scrapers as Anubis claims.

  • kingstnap 4 days ago

    There is literally no point wasting 3 seconds of a computer's time and it's expensive wasting 3 seconds of a person's time.

    That is literally an anti-human filter.

    • Imustaskforhelp 4 days ago

      From tjhorner on this same thread

      "Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project."

      So its meant/preferred to block low effort crawlers which can still cause damage if you don't deal with them. a 3 second deterrent seems good in that regard. Maybe the 3 second deterrent can come as in rate limiting an ip? but they might use swath's of ip :/

      • OkayPhysicist 4 days ago

        Anubis exists specifically to handle the problem of bots dodging IP rate limiting. The challenge is tied to your IP, so if you're cycling IPs with every request, you pay dramatically more PoW than someone using a single IP. It's intended to be used in depth with IP rate limiting.

    • loeg 3 days ago

      Anubis easily wastes 3 seconds of a human's time already.

  • ranger_danger 4 days ago

    Yea I'm not convinced unless somehow the vast majority of scrapers aren't already using headless browsers (which I assume they are). I feel like all this does is warm the planet.

jmclnx 4 days ago

>The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans

Not for me, I have nothing but a hard time solving CAPTCHAs, ahout 50% of the time I give up after 2 tries.

  • serf 4 days ago

    it's still certainly trivial for you compared to mentally computing a SHA256 op.

anarki8 3 days ago

Article might be a bit shallow, or maybe my understanding of how Anubis works is incorrect?

1. Anubis makes you calculate a challenge.

2. You get a "token" that you can use for a week to access the website.

3. (I don't see this being considered in the article) "token" that is used too much is rate limited. Calculating a new token for each request is expensive.

  • Aachen 3 days ago

    That, but apparently also restrictions on what tech you can use to access the website:

    - https://news.ycombinator.com/item?id=44971990 person being blocked with `message looking something like "you failed"`

    - https://news.ycombinator.com/item?id=44970290 mentions of other requirements that are allegedly on purpose to block older clients (as browser emulators presumably often would appear to be, because why would they bother implementing newer mechanisms when the web has backwards compatibility)

  • jeroenhd 3 days ago

    That's the basic principle. It's a tool to fight to crawlers that spam requests without cookies to prevent rate limiting.

    The Chinese crawlers seem to have adjusted their crawling techniques to give their browsers enough compute to pass standard Anubis checks.

SnuffBox 2 days ago

Whenever I see an otherwise civil and mature project utilize something outwardly childish like this I audibly groan and close the page.

I'm sure the software behind it is fine but the imagery and style of it (and the confidence to feature it) makes me doubt the mental credibility/social maturity of anybody willing to make it the first thing you see when accessing a webpage.

Edit: From a quick check of the "CEO" of the company, I was unsurprised to have my concerns confirmed. I may be behind the times but I think there are far too many people in who act obnoxiously (as part of what can only be described as a new subculture) in open source software today and I wish there were better terms to describe it.

listic 4 days ago

So... Is Anubis actually blocking bots because they didn't bother to circumvent it?

  • loloquwowndueo 3 days ago

    Basically. Anubis is meant to block mindless, careless, rude bots with seemingly no technically proficient human behind the process; these bots tend to be very aggressive and make tons of requests bringing sites down.

    The assumption is that if you’re the operator of these bots and care enough to implement the proof of work challenge for Anubis you could also realize your bot is dumb and make it more polite and considerate.

    Of course nothing precludes someone implementing the proof of work on the bot but otherwise leaving it the same (rude and abusive). In this case Anubis still works as a somewhat fancy rate limiter which is still good.

    • elcritch 3 days ago

      Essentially the Pow aspect is pointless then? They could require almost any arbitrary thing.

      • loloquwowndueo 3 days ago

        What else do you envision being used instead of proof of work?

walthamstow 3 days ago

> I host this blog on a single core 128MB VPS

Where does one even find a VPS with such small memory today?

  • tambourine_man 3 days ago

    Or software to run on it. I'm intrigued about this claim as well.

    • Aachen 3 days ago

      The software is easy. Apt install debian apache2 php certbot and you're pretty much set to deploy content to /var/www. I'm sure any BSD variant is also fine, or lots of other software distributions that don't require a graphical environment

      On an old laptop running Windows XP (yes, with GUI, breaking my own rule there) I've also run a lot of services, iirc on 256MB RAM. XP needed about 70 I think, or 52 if I killed stuff like Explorer and unnecessary services, and the remainder was sufficient to run a uTorrent server, XAMPP (Apache, MySQL, Perl and PHP) stack, Filezilla FTP server, OpenArena game server, LogMeIn for management, some network traffic monitoring tool, and probably more things I'm forgetting. This ran probably until like 2014 and I'm pretty sure the site has been on the HN homepage with a blog post about IPv6. The only thing that I wanted to run but couldn't was a Minecraft server that a friend had requested. You can do a heck of a lot with a hundred megabytes of free RAM but not run most Javaware :)

xphos 4 days ago

Yeah the PoW is minor for botters but annoying people. I think the only positive is if enough people see anime girls on there screens there might actually be political pressure to make laws against rampent bot crawling

  • Havoc 3 days ago

    > PoW is minor for botters

    But still enough to prevent a billion request DDoS

    These sites have been search engine scrapped forever. It’s not about blocking bots entirely just about this new wave of fuck you I don’t care if your host goes down quasi malicious scrappers

    • st3fan 3 days ago

      "But still enough to prevent a billion request DDoS" - don't you just do the PoW once to get a cookie and then you can browse freely?

      • seba_dos1 3 days ago

        Yes, but a single bot is not a concern. It's the first "D" in DDoS that makes it hard to handle

        (and these bots tend to be very, very dumb - which often happens to make them more effective at DDoSing the server, as they're taking the worst and the most expensive ways to scrape content that's openly available more efficiently elsewhere)

    • elcritch 3 days ago

      Reading TFA, those billions requests would cost web crawlers what about $100 in compute?

serf 4 days ago

I don't care that they use anime catgirls.

What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net.

I hate Amazon's failure pets, I hate google's failure mini-games -- it strikes me as an organizational effort to get really good at failing rather than spending that same effort to avoid failures all together.

It's like everyone collectively thought the standard old Apache 404 not found page was too feature-rich and that customers couldn't handle a 3 digit error, so instead we now get a "Whoops! There appears to be an error! :) :eggplant: :heart: :heart: <pet image.png>" and no one knows what the hell is going on even though the user just misplaced a number in the URL.

  • SnuffBox 2 days ago

    This is something I've always felt about design in general. You should never make it so that a symbol for an inconvenience appears happy or smug, it's a great way to turn people off your product or webpage.

    Reddit implemented something a while back that says "You've been blocked by network security!" with a big smiling Reddit snoo front and centre on the page and every time I bump into it I can't help but think this.

  • heeton 3 days ago

    > What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net

    This is probably intentional. They offer an paid unbranded version. If they had a corporate friendly brand on the free offering, then there would be fewer people paying for the unbranded one.

  • xandrius 4 days ago

    The original versions were a way to make fun even a boring event such as a 404. If the page stops conveying the type of error to the user then it's just bad UX but also vomiting all the internal jargon to a non-tech user is bad UX.

    So, I don't see an error code + something fun to be that bad.

    People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today, so I don't see how having fun error pages to be such an issue?

    • Hizonner 4 days ago

      This assumes it's fun.

      Usually when I hit an error page, and especially if I hit repeated errors, I'm not in the mood for fun, and I'm definitely not in the mood for "fun" provided by the people who probably screwed up to begin with. It comes off as "oops, we can't do anything useful, but maybe if we try to act cute you'll forget that".

      Also, it was more fun the first time or two. There's a not a lot of orginal fun on the error pages you get nowadays.

      > People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today

      It's been a while, but I don't remember much gratuitous cutesiness on the 90s Web. Not unless you were actively looking for it.

      • doublerabbit 4 days ago

        > This assumes it's fun.

        Not to those who don't exist in such cultures. It's creepy, childish, strange to them. It's not something they see in everyday life, nor would I really want to. There is a reason why cartoons are aimed for younger audiences.

        Besides if your webserver is throwing errors, you've configured it incorrectly. Those pages should be branded as the site design with a neat and polite description to what the error is.

  • JdeBP 4 days ago

    Guru Meditations and Sad Macs are not your thing?

    • krige 3 days ago

      FWIW second and third iteration of AmigaOS didn't have "Guru Meditation"; instead it bluntly labeled the numbers as error and task.

    • Hizonner 3 days ago

      That also got old when you got it again and again while you were trying to actually do something. But there wasn't the space to fit quite as much twee on the screen...

thayne 3 days ago

I can't find any documentation that says Anubis does this, (although it seems odd to me that it wouldn't, and I'd love a reference) but it could do the following:

1. Store the nonce (or some other identifier) of each jwt it passes out in the data store

2. Track the number or rate of requests from each token in the data store

3. If a token exceeds the rate limit threshold, revoke the token (or do some other action, like tarpit requests with that token, or throttle the requests)

Then if a bot solves the challenge it can only continue making requests with the token if it is well behaved and doesn't make requests too quickly.

It could also do things like limit how many tokens can be given out to a single ip address at a time to prevent a single server from generating a bunch of tokens.

8cvor6j844qw_d6 2 days ago

On my daily browser with V8 JIT disabled, Cloudflare Turnstile has the worst performance hit, and often requires an additional click to clear.

Anubis usually clears in with no clicks and no noticeable slowdown, even with JIT off. Among the common CAPTCHA solutions it's the least annoying for me.

ajsnigrutin 3 days ago

I always wondered about these anti bot precautions... as a firefox user, with ad blocking and 3rd party cookies disabled, i get the goddamn captcha or other random check (like this) on a bunch of pages now, every time i visit them...

Is it worth it? Millions of users wasting cpu and power for what? Saving a few cents on hosting? Just rate limit requests per second per IP and be done.

Sooner or later bots will be better at captchas than humans, what then? What's so bad with bots reading your blog? When bots evolve, what then? UK style, scan your ID card before you can visit?

The internet became a pain to use... back in the time, you opened the website and saw the content. Now you open it, get an antibot check, click, forward to the actual site, a cookie prompt, multiple clicks, then a headline + ads, scroll down a milimeter... do you want to subscribe to a newsletter? Why, i didn't even read the first sentence of the article yet... scroll down.. chat with AI bot popup... a bit further down login here to see full article...

Most of the modern web is unusable. I know I'm ranting, but this is just one of the pieces of a puzzle that makes basic browsing a pain these days.

tortillasauce 3 days ago

Anubis works because AI crawlers do very little requests from an ip address to bypass rate-limiting. Last year they could still be blocked by ip range, but now the requests are from so many different networks that doesn't work anymore.

Doing the proof-of-work for every request is apparently too much work for them.

Crawlers using a single ip, or multiple ips from a single range are easily identifiable and rate-limited.

account42 3 days ago

Good on you that you found a solution to myself but personally I will just not use websites that pull this and not contribute to projects where using such a website is required. If you respect me so little that you will make demands about how I use my computer and block me as a bot if I don't comply then I am going to assume that you're not worth my time.

  • anarki8 3 days ago

    This sounds a bit overdramatic for a less than a second waiting time per week for each device. Unless you employ an army of crawlers of course.

  • russelg 3 days ago

    Interesting take to say the Linux Kernel is not worth your time.

    • account42 3 days ago

      As far as I know Linux kernel contributions still use email.

extraduder_ire 3 days ago

With the asymmetry of doing the PoW in javascript versus compiled c code, I wonder if this type of rate limiting is ever going to be directly implemented into regular web browsers. (I assume there's already plugins for curl/wget)

Other than Safari, mainstream browsers seem to have given up on considering browsing without javascript enabled a valid usecase. So it would purely be a performance improvement thing.

  • Aachen 3 days ago

    Apple supports people that want to not use their software as the gods at Apple intended it? What parallel universe Version of Apple is this!

    Seriously though, does anything of Apple's work without JS, like Icloud or Find my phone? Or does Safari somehow support it in a way that other browsers don't?

    • extraduder_ire 3 days ago

      Last I checked, safari still had a toggle to disable javascript long after both chrome and firefox removed theirs. That's what I was referring to.

jonathanyc 3 days ago

> The idea of “weighing souls” reminded me of another anti-spam solution from the 90s… believe it or not, there was once a company that used poetry to block spam!

> Habeas would license short haikus to companies to embed in email headers. They would then aggressively sue anyone who reproduced their poetry without a license. The idea was you can safely deliver any email with their header, because it was too legally risky to use it in spam.

Kind of a tangent but learning about this was so fun. I guess it's ultimately a hack for there not being another legally enforceable way to punish people for claiming "this email is not spam"?

IANAL so what I'm saying is almost certainly nonsense. But it seems weird that the MIT license has to explicitly say that the licensed software comes with no warranty that it works, but that emails don't have to come with a warranty that they are not spam! Maybe it's hard to define what makes an email spam, but surely it is also hard to define what it means for software to work. Although I suppose spam never e.g. breaks your centrifuge.

Philpax 4 days ago

The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.

I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.

  • davidclark 4 days ago

    The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?

    • Philpax 4 days ago

      The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.

      That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.

      • yborg 4 days ago

        >do you really need to be rescraping every website constantly Yes, because if you believe you out-resource your competition, by doing this you deny them training material.

    • hooverd 4 days ago

      The problem with crawlers if that they're functionally indistinguishable from your average malware botnet in behavior. If you saw a bunch of traffic from residential IPs using the same token that's a big tell.

Borg3 4 days ago

Oh, its time to bring Internet back to humans. Maybe its time to treat first layer of Internet just as transport. Then, layer large VPN networks and put services there. People will just VPN to vISP to reach content. Different networks, different interests :) But this time dont fuck up abuse handling. Someone is doing something fishy? Depeer him from network (or his un-cooperating upstream!).

galaxyLogic 3 days ago

I think the solution to captcha-rot is micro-payments. It does consume resources to serve a web-page so whose gonna pay for that?

If you want to do advertisement then don't require a payment, and be happy that crawlers will spread your ad to the users of AI-bots.

If you are a non-profit-site then it's great to get a micro-payment to help you maintain and run the site.

pembrook 3 days ago

Something feels bizarrely incongruent about the people using Anubis. These people used to be the most vehemently pro-piracy, pro internet freedom and information accessibility, etc.

Yet now when it's AI accessing their own content, suddenly they become the DMCA and want to put up walls everywhere.

I'm not part of the AI doomer cult like many here, but it would seem to me that if you publish your content publicly, typically the point is that it would be publicly available and accessible to the world...or am I crazy?

As everything moves to AI-first, this just means nobody will ever find your content and it will not be part of the collective human knowledge. At which point, what's the point of publishing it.

  • SnuffBox 2 days ago

    It is rather funny. "We must prevent AI accessing the Arch Linux help files or it will start the singularity and kill us all!"

  • GreenWatermelon a day ago

    In case you're genuinely confused, the reason for Anubis and similar tools is that AI-training-data-scraping crawlers are assholes, and strangle the living shit out of any webserver they touch, like a cloud of starving locusts descending upon a wheat field.

    i.e. it's DDoS protection.

Aachen 3 days ago

> an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources

Sure, if you ignore that humans click on one page and the problematic scrapers (not the normal search engine volume, but the level we see nowadays where misconfigured crawlers go insane on your site) are requesting many thousands to millions of times more pages per minute. So they'll need many many times the compute to continue hammering your site whereas a normal user can muster to load that one page from the search results that they were interested in

[removed] 3 days ago
[deleted]
lxgr 4 days ago

> This isn’t perfect of course, we can debate the accessibility tradeoffs and weaknesses, but conceptually the idea makes some sense.

It was arguably never a great idea to begin with, and stopped making sense entirely with the advent of generative AI.

zoobab 3 days ago

Time to switch to stagit. Unfortunately it does not generate static pages for a git repo except "master". I am sure someone will modify to support branches.

heap_perms 3 days ago

> I host this blog on a single core 128MB VPS

No wonder the site is being hugged to death. 128MB is not a lot. Maybe it's worth to upgrade if you post to hacker news. Just a thought.

  • bawolff 3 days ago

    It doesnt take much to host a static website. Its all the dynamic stuff/frameworks/db/etc that bogs everything down.

    • tambourine_man 3 days ago

      Still, 128MB is not enough to even run Debian let alone Apache/NGINX. I’m on my phone, but it doesn’t seem like the author is using Cloudflare or another CDN. I’d like to know what they are doing.

      • ronsor 3 days ago

        128MB is more than enough to run Debian and serve a static site. I had no issue with doing it a decade ago and it still works fine.

        How much memory do you think it actually takes to accept a TLS connection and copy files from disk to a socket?

  • Aachen 3 days ago

    Moving bytes around doesn't take RAM but CPU. Notice how switches don't advertise how many gigabytes of RAM they have, but can push a few gigabits of content around between all 24 ports at once without even going expensive

    Also, the HN homepage is pretty tame so long as you don't run WordPress. You don't get more than a few requests per second, so multiply that with the page size (images etc.) and you probably get a few megabits as bandwidth, no problem even for a Raspberry Pi 1 if the sdcard can read fast enough or the files are mapped to RAM by the kernel

  • [removed] 3 days ago
    [deleted]
trostaft 3 days ago

I actually really liked seeing the mascot. Brought a sense of whimsy to the Internet that I've missed for a long time.

ksymph 4 days ago

Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.

That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.

[0] https://xeiaso.net/blog/2025/anubis/

  • jhanschoo 4 days ago

    Your link explicitly says:

    > It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.

    It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.

    It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.

    • ksymph 4 days ago

      Here's a more relevant quote from the link:

      > Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.

      As the article notes, the work required is negligible, and as the linked post notes, that's by design. Wasting scraper compute is part of the picture to be sure, but not really its primary utility.

      • kevincox 4 days ago

        Why require proof of work with difficulty at all then? Just have no UI other than (javascript) required and run a trivial computation in WASM as a way of testing for modern browser features. That way users don't complain that it is taking 30s on their low-end phone and it doesn't make it any easier for scrapers to scrape (because the PoW was trivial anyways).

    • ranger_danger 4 days ago

      The compute also only seems to happen once, not for every page load, so I'm not sure how this is a huge barrier.

      • untilted 4 days ago

        Once per ip. Presumably there's ip-based rate limiting implemented on top of this, so it's a barrier for scrapers that aggressively rotate ip's to circumvent rate limits.

      • debugnik 4 days ago

        It happens once if the user agent keeps a cookie that can be used for rate limiting. If a crawler hits the limit they need to either wait or throw the cookie away and solve another challenge.

zb3 4 days ago

Anubis doesn't use enough resources to deter AI bots. If you really want to go this way, use React, preferably with more than one UI framework.

  • loloquwowndueo 3 days ago

    Anubis is based on hashcash concepts - just adapted to a web request flow. Basically the same thing - moderately expensive for the sender/requester to compute, insanely cheap for the server/recipient to verify.

  • littlecranky67 3 days ago

    We need bitcoin-based lightning nano-payments for such things. Like visiting the website will cost $0.0001 cent, the lightning invoice is embedded in the header and paid for after single-click confirmation or if threshold is under a pre-configured value. Only way to deal with AI crawlers and future AI scams.

    With the current approach we just waste the energy, if you use bitcoin already mined (=energy previously wasted) it becomes sustainable.

herf 3 days ago

We deployed hashcash for a while back in 2004 to implement Picasa's email relay - at the time it was a pretty good solution because all our clients were kind of similar in capability. Now I think the fastest/slowest device is a broader range (just like Tavis says), so it is harder to tune the difficulty for that.

fluoridation 4 days ago

Hmm... What if instead of using plain SHA-256 it was a dynamically tweaked hash function that forced the client to run it in JS?

  • jsnell 4 days ago

    No, the economics will never work out for a Proof of Work-based counter-abuse challenge. CPU is just too cheap in comparison to the cost of human latency. An hour of a server CPU costs $0.01. How much is an hour of your time worth?

    That's all the asymmetry you need to make it unviable. Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users. So there's no point in theorizing about an attacker solving the challenges cheaper than a real user's computer, and thus no point in trying to design a different proof of work that's more resistant to whatever trick the attackers are using to solve it for cheap. Because there's no trick.

    • pavon 4 days ago

      But for a scraper to be effective it has to load orders of magnitude more pages than a human browses, so a fixed delay causes a human to take 1.1x as long, but it will slow down scraper by 100x. Requiring 100x more hardware to do the same job is absolutely a significant economic impediment.

      • jsnell 3 days ago

        The entire problem is that proof of work does not increase the cost of scraping by 100x. It does not even increase it by 100%. If you run the numbers, a reasonable estimate is that it increases the cost by maybe 0.1%. It is pure snakeoil.

    • fluoridation 4 days ago

      >An hour of a server CPU costs $0.01. How much is an hour of your time worth?

      That's irrelevant. A human is not going to be solving the challenge by hand, nor is the computer of a legitimate user going to be solving the challenge continuously for one hour. The real question is, does the challenge slow down clients enough that the server does not expend outsized resources serving requests of only a few users?

      >Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users.

      No, I disagree. If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.

      • michaelt 4 days ago

        The problem with proof-of-work is many legitimate users are on battery-powered, 5-year-old smartphones. While the scraping servers are huge, 96-core, quadruple-power-supply beasts.

      • jsnell 4 days ago

        The human needs to wait for their computer to solve the challenge.

        You are trading something dirt-cheap (CPU time) for something incredibly expensive (human latency).

        Case in point:

        > If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.

        No. A human sees a 10x slowdown. A human on a low end phone sees a 50x slowdown.

        And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)

        That is not an effective deterrent. And there is no difficulty factor for the challenge that will work. Either you are adding too much latency to real users, or passing the challenge is too cheap to deter scrapers.

  • VMG 4 days ago

    crawlers can run JS, and also invest into running the Proof-Of-JS better than you can

    • tjhorner 4 days ago

      Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.

      • scratchyone 3 days ago

        wait but then why bother with this PoW system at all? if they're just trying to block anyone without JS that's way easier and doesn't require slowing things down for end users on old devices.

      • Imustaskforhelp 4 days ago

        reminds of how wikipedia literally has all the data available even in a nice format just for scrapers (I think) and even THEN, there are some scrapers which still scraped wikipedia and actually made wikipedia lose some money so much that I am pretty sure that some official statement had to be made or they disclosed about it without official statement.

        Even then, man I feel like you yourself can save on so many resources (both yours) and (wikipedia) if scrapers had the sense to not scrape wikipedia and instead follow wikipedia's rules

    • fluoridation 4 days ago

      If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.

jchw 4 days ago

> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.

A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.

Secondly, Anubis specifically targets bots that try to blend in with human traffic. Bots that don't try to blend in with humans are basically ignored and out-of-scope. Most malicious bots don't want to be targeted, so they want to blend in... so they kind of have to deal with this. If they want to avoid the Anubis challenge, they have to essentially identify themselves. If not, they have to solve it.

Finally... If bots really want to durably be able to pass Anubis challenges, they pretty much have no choice but to run the arbitrary code. Anything else would be a pretty straight-forward cat and mouse game. And, that means that being able to accelerate the challenge response is a non-starter: if they really want to pass it, and not appear like a bot, the path of least resistance is to simply run a browser. That's a big hurdle and definitely does increase the complexity of scraping the Internet. It increases more the more sites that use this sort of challenge system. While the scrapers have more resources, tools like Anubis scale the resources required a lot more for scraping operations than it does a specific random visitor.

To me, the most important point is that it only fights bot traffic that intentionally tries to blend in. That's why it's OK that the proof-of-work challenge is relatively weak: the point is that it's non-trivial and can't be ignored, not that it's particularly expensive to compute.

If bots want to avoid the challenge, they can always identify themselves. Of course, then they can also readily be blocked, which is exactly what they want to avoid.

In the long term, I think the success of this class of tools will stem from two things:

1. Anti-botting improvements, particularly in the ability to punish badly behaved bots, and possibly share reputation information across sites.

2. Diversity of implementations. More implementations of this concept will make it harder for bots to just hardcode fastpath challenge response implementations and force them to actually run the code in order to pass the challenge.

I haven't kept up with the developments too closely, but as silly as it seems I really do think this is a good idea. Whether it holds up as the metagame evolves is anyone's guess, but there's actually a lot of directions it could be taken to make it more effective without ruining it for everyone.

  • o11c 3 days ago

    > A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.

    ... has phpbb not heard of the old "only create the session on the second visit, if the cookie was successfully created" trick?

    • jchw 3 days ago

      phpBB supports browsers that don't support or accept cookies: if you don't have a cookie, the URL for all links and forms will have the session ID in it. Which would be great, but it seems like these bots are not picking those up either for whatever reason.

  • MikeDVB 3 days ago

    We have been seeing our clients' sites being absolutely *hammered* by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.

    Personally I have no issues with AI bots, that properly identify themselves, from scraping content as if the site operator doesn't want it to happen they can easily block the offending bot(s).

    We built our own proof-of-work challenge that we enable on client sites/accounts as they come under 'attack' and it has been incredible how effective it is. That said I do think it is only a matter of time before the tactics change and these "malicious" AI bots are adapted to look more human / like real browsers.

    I mean honestly it wouldn't be _that_ hard to enable them to run javascript or to emulate a real/accurate User-Agent. That said they could even run headless versions of the browser engines...

    It's definitely going to be cat-and-mouse.

    The most brutal honest truth is that if they throttled themselves as not to totally crash whatever site they're trying to scrape we'd probably have never noticed or gone through the trouble of writing our own proof-of-work challenge.

    Unfortunately those writing/maintaining these AI bots that hammer sites to death probably either have no concept of the damage it can do or they don't care.

    • jchw 3 days ago

      > We have been seeing our clients' sites being absolutely hammered by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.

      Yep. I noticed this too.

      > That said they could even run headless versions of the browser engines...

      Yes, exactly. To my knowledge that's what's going on with the latest wave that is passing Anubis.

      That said, it looks like the solution to that particular wave is going to be to just block Huawei cloud IP ranges for now. I guess a lot of these requests are coming from that direction.

      Personally though I think there are still a lot of directions Anubis can go in that might tilt this cat and mouse game a bit more. I have some optimism.

      • MikeDVB 3 days ago

        I haven't seen much if anything getting past our pretty simple proof-of-work challenge but I imagine it's only a matter of time.

        Thankfully, so far, it's still been pretty easy to block them by their user agents as well.

yuumei 4 days ago

> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans. > Anubis – confusingly – inverts this idea.

Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.

  • [removed] 4 days ago
    [deleted]
qwertytyyuu 3 days ago

Isn’t animus a dog? So it should be anime dog/wolf girl rather than cat girl?

  • Twisol 3 days ago

    Yes, Anubis is a dog-headed or jackal-headed god. I actually can't find anywhere on the Anubis website where they talk about their mascot; they just refer to her neutrally as the "default branding".

    Since dog girls and cat girls in anime can look rather similar (both being mostly human + ears/tail), and the project doesn't address the point outright, we can probably forgive Tavis for assuming catgirl.

usbpoet 3 days ago

I don't think I've ever actually seen Anubis once. Always interesting to see what's going on in parts of the internet you aren't frequenting.

  • dominick-cc 3 days ago

    I read hackernews on my phone when I'm bored and I've seen it a lot lately. I don't think I've ever seen it on my desktop.

immibis 4 days ago

The actual answer to how this blocks AI crawlers is that they just don't bother to solve the challenge. Once they do bother solving the challenge, the challenge will presumably be changed to a different one.

est 3 days ago

I hope there's some kind of memory-hungry checker to replace the CPU cost.

a 2GB memory consumption wont stop them, but it will limit the parallelism of crawlers.

deevus 3 days ago

This seems like a good place to ask. How do I stop bots from signing up to my email list on my website without hosting a backend?

  • account42 3 days ago

    Depending on your target audience you could require people signing up to send you and email first.

auggierose 3 days ago

Would it not be more effective just to require payment for accessing your website? Then you don't need to care about bot or not.

miohtama 3 days ago

The solution is to make premium subscription service for those who do not want to solve CAPTCHAs.

Money is the best proof of humanity.

  • lock1 3 days ago

    Isn't that line of reasoning implies companies with multi-billion dollars in their war chest are much more "human" than a literal human with student loans?

spiritplumber 3 days ago

For the same reason why cats sit on your keyboard. Because they can

0003 3 days ago

Soon any attempt to actually do it would indicate you're a bot.

whatevaa 3 days ago

Site doesn't load, must be hit by AI crawlers.

anotherhue 4 days ago

Surely the difficulty factor scales with the system load?

raffraffraff 4 days ago

HN hug of death

  • mr_toad 4 days ago

    I’m getting a black page. Not sure if it’s an ironic meta commentary, or just my ad blocker.

pluc 3 days ago

Can we talk about the "sexy anime girl" thing? Seems it's popular in geek/nerd/hacker circles and I for one don't get it. Browsing reddit anonymously you're flooded with near-pornographic fan-made renders of these things, I really don't get the appeal. Can someone enlighten me?

  • abustamam 3 days ago

    It's a good question. Anime (like many media, but especially anime) is known to have gratuitous fan service where girls/women of all ages are in revealing clothing for seemingly no reason except to just entice viewers.

    The reasoning is that because they aren't real people, it's okay to draw and view images of anime, regardless of their age. And because geek/nerd circles tend not to socialize with real women, we get this over-proliferation of anime girls.

    • pluc 3 days ago

      This also was my best guess. A "victimless crime" kind of logic that really really creeps me out.

  • andai 3 days ago

    Probably depends on the person, but this stuff is mostly the cute instinct, same as videos of kittens. "Aww" and "I must protect it."

  • dominick-cc 3 days ago

    2D girls don't nag and I've never had to clear their clogged hair out of my shower drain.

  • SnuffBox 2 days ago

    I'd say it's partially a result of 4chan.

buyucu 2 days ago

We're 1-2 years away from putting the entire internet behind Cloudflare, and Anubis is what upsets you? I really don't get these people. Seeing an anime catgirl for 1-2 seconds won't kill you. It might save the internet though.

The principle behind Anubis is very simple: it forces every visitor to brute force a math problem. This cost is negligible if you're running it on your computer or phone. However, if you are running thousands of crawlers in parallel, the cost adds up. Anubis basically makes it expensive to crawl the internet.

It's not perfect, but much much better than putting everything behind Cloudflare.

tonymet 4 days ago

So it's a paywall with -- good intentions -- and even more accessibility concerns. Thus accelerating enshittification.

Who's managing the network effects? How do site owners control false positives? Do they have support teams granting access? How do we know this is doing any good?

It's convoluted security theater mucking up an already bloated , flimsy and sluggish internet. It's frustrating enough to guess schoolbuses every time I want to get work done, now I have to see porfnified kitty waifus

(openwrt is another community plagued with this crap)

lousken 4 days ago

aren't you happy? at least you see catgirl

verall 3 days ago

It's posts like this that make me really miss the webshit weekly

a-dub 3 days ago

i suppose one nice property is that it is trivially scalable. if the problem gets really bad and the scrapers have llms embedded in them to solve captchas, the difficulty could be cranked up and the lifetime could be cranked down. it would make the user experience pretty crappy (party like it's 1999) but it could keep sites up for unauthenticated users without engaging in some captcha complexity race.

it does have arty political vibes though, the distributed and decentralized open source internet with guardian catgirls vs. late stage capitalism's quixotic quest to eat itself to death trying to build an intellectual and economic robot black hole.

xena 4 days ago

[dead]

  • tptacek 4 days ago

    You needed to have a security contact on your website, or at least in the repo. You did not. You assumed security researchers would instead back out to your Github account's repository list, find the .github repository, and look for a security policy there. That's not a thing!

    I'm really surprised you wrote this.

    • qualeed 4 days ago

      >I'm really surprised you wrote this.

      I agree with the rest of your comment, but this seems like a weird little jab to add on for no particular reason. Am I misinterpreting?

      • tptacek 4 days ago

        No, there's some background context I'm not sharing, but it's not interesting. I didn't mean to be cryptic, but, obviously, I managed to be cryptic. I promise you're not missing anything.

  • withinrafael 4 days ago

    The security policy that didn't exist until a few hours ago?

naikrovek 4 days ago

[flagged]

  • s1mplicissimus 4 days ago

    Isn't using an anime catgirl avatar the exact opposite of "look at meee"?

    • naikrovek 3 days ago

      no. it's someone wanting attention and feeling ok creating an interstitial page to capture your attention which does not prove you're a human while saying that the page proves you're human.

      the entire thing is ridiculous. and only those who see no problem shoving anime catgirls into the face of others will deploy it. maybe that's a lot of people; maybe only I object to this. The reality is that there's no technical reason to deploy it, as called out in the linked blog article, so the only reason to do this is a "look at meee" reason, or to announce that one is a fan of this kind of thing, which is another "look at meee"-style reason.

      Why do I object to things like this? Because once you start doing things like this, doing things for attention, you must continually escalate in order to keep capturing that attention. Ad companies do this, and they don't see a problem in escalation, and they know they have to do it. People quickly learn to ignore ads, so in order to make your page loads count as an advertiser, you must employ means which draw attention to your ads. It's the same with people who no longer draw attention to themselves because they like anime catgirls. Now they must put an interstitial page up to force you to see that they like anime catgirls. We've already established that the interstitial page accomplishes nothing other than showing you the image, so showing the image must be the intent. That is what I object to.

  • AuthAuth 4 days ago

    Its just a mascot you are projecting way to much.

PaulHoule 4 days ago

[flagged]

  • dathinab 4 days ago

    you are overthinking

    it's a simple as having a nice picture there make this whole thing feel nicer, and give it a bit of personality

    so you put in some picture/art you like

    that's it

    similar any site sing it can change that picture, but there isn't any fundamental problem with the picture, so most can't care to change it

alt187 3 days ago

[flagged]

  • sethaurus 3 days ago

    Could you expand on what ideology this tool is broadcasting and what virtue is being signalled?

    • haskellshill 3 days ago

      >Please call me (order of preference): They/them or She/her please.

      Take a wild guess

      • sethaurus 3 days ago

        Well that tells me the rough cultural area alt187 might be pointing to, but I could use some more clarity. Are you saying that pronouns/transgender/queerness are the ideology? Or are you showing them as shibboleths of a broader ideological tendency that's prevailing in FOSS?

        • haskellshill 3 days ago

          For lack of a better term one might describe the ideology as "woke"

      • Deestan 3 days ago

        Please show me on the doll where this stranger's personal identity hurt you.

  • fortran77 3 days ago

    Exactly right. Few here get it because everyone here climbs over each other to see who can virtue signal the most.

ge96 4 days ago

Oh I saw this recently on ffmpeg's site, pretty fun

senectus1 3 days ago

the action is great, anubis is a very clever idea i love it.

I'm not a huge fan of the anime thing, but i can live with it.

valiant55 4 days ago

I really don't understand the hostility towards the mascot. I can't think of a bigger red flag.

  • Borgz 4 days ago

    Funny to say this when the article literally says "nothing wrong with mascots!"

    Out of curiosity, what did you read as hostility?

    • valiant55 4 days ago

      Oh I totally reacted to the title. The last few times Anubis has been the topic there's always comments about "cringy" mascot and putting that front and center in the title just made me believe that anime catgirls was meant as an insult.

      • Imustaskforhelp 4 days ago

        Honestly I am okay with anime catgirls since I just find it funny but still it would be cool to see linux related stuff. Imagine mr tux penguin gif of him racing in like supertuxcart for the linux website.

        sourcehut also uses anubis but they have removed the anime catgirl thing with their own logo, I think disroot also does that I am not sure though

efilife 4 days ago

This cartoon mascot has absolutely nothing to do with anime

If you disagree, please say why

KolmogorovComp 3 days ago

Why does Anubis not leverage PoW from its users to do something useful (at best, distributed computing for science, at worst, a crypto-currency at least allowing the webmasters to get back some cash)

  • johnklos 3 days ago

    People are already complaining. Could you imagine how much fodder this'd give people who didn't like the work or the distribution of any funds that a cryptocurrency would create (which would be pennies, I think, and more work to distribute than would be worth doing).

rnhmjoj 4 days ago

I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?

I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.

  • mnmalst 4 days ago

    Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.

  • hooverd 4 days ago

    less savory crawlers use residential proxies and are indistinguishable from malware traffic

  • busterarm 4 days ago

    Lots of companies run these kind of crawlers now as part of their products.

    They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.

    There are lots of companies around that you can buy this type of proxy service from.

  • WesolyKubeczek 4 days ago

    You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.

    • rnhmjoj 4 days ago

      Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.

      [1]: https://pod.geraspora.de/posts/17342163

      • nemothekid 4 days ago

        OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?

        I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world

superkuh 4 days ago

Kernel.org* just has to actually configure Anubis rather than deploying the default broken config. Enable the meta-refresh proof of work rather than relying on the corporate browsers only bleeding edge javascript application proof of work.

* or whatever site the author is talking about, his site is currently inaccessible due to the amount of people trying to load it.

zaptrem 3 days ago

If people are truly concerned about the crawlers hammering their 128mb raspberry pi website then a better solution would be to provide an alternative way for scrapers to access the data (e.g., voluntarily contribute a copy of their public site to something like common crawl).

If Anubis blocked crawler requests but helpfully redirected to a giant tar ball of every site using their service (with deltas or something to reduce bandwidth) I bet nobody would bother actually spending the time to automate cracking it since it’s basically negative value. You could even make it a torrent so most of the be costs are paid by random large labs/universities.

I think the real reason most are so obsessed with blocking crawlers is they want “their cut”… an imagined huge check from OpenAI for their fan fiction/technical reports/whatever.

  • sussmannbaka 3 days ago

    No, this doesn’t work. Many of the affected sites have these but they’re ignored. We’re talking about git forges, arguably the most standardised tool in the industry, where instead of just fetching the repository every single history revision of every single file gets recursively hammered to death. The people spending the VC cash to make the internet unusable right now don’t know how to program. They especially don’t give a shit about being respectful. They just hammer all the sites, all the time, forever.

  • msgodel 3 days ago

    I'm generally very pro-robot (every web UA is a robot really IMO) but these scrapers are exceptionally poorly written and abusive.

    Plenty of organizations managed to crawl the web for decades without knocking things over. There's no reason to behave this way.

    It's not clear to me why they've continued to run them like this. It seems so childish and ignorant.

    • zaptrem 3 days ago

      The bad scrapers would get blocked by the wall I mentioned. The ones intelligent enough to break the wall would simply take the easier way out and download the alternative data source.

  • lmm 3 days ago

    The kind of crawlers/scrapers who DDoS a site like this aren't going to bother checking common crawl or tarballs. You vastly overestimate the intelligence and prosociality of what bursty crawler requests tend to look like. (Anyone who is smart or prosocial will set up their crawler to not overwhelm a site with requests in the first place - yet any site with any kind of popularity gets flooded with these requests sooner or later)

    • zaptrem 3 days ago

      If they don’t have the intelligence to go after the more efficient data collection method then they likely won’t have the intelligence or willpower to work around the second part I mentioned (keeping something like Anubis). The only problem is when you put Anubis in the way of determined, intelligent crawlers without giving them a choice that doesn’t involve breaking Anubis.

  • shiomiru 3 days ago

    > I think the real reason most are so obsessed with blocking crawlers is they want “their cut”…

    I find that an unfair view of the situation. Sure, there are examples such as StackOverflow (which is ridiculous enough as they didn't make the content) but the typical use case I've seen on the small scale is "I want to self-host my git repos because M$ has ruined GitHub, but some VC-funded assholes are drowning the server in requests".

    They could just clone the git repo, and then pull every n hours, but it requires specialized code so they won't. Why would they? There's no money in maintaining that. And that's true for any positive measure you may imagine until these companies are fined for destroying the commons.

  • elsjaako 3 days ago

    There's a lot of people that really don't like AI, and simply don't want their data used for it.

    • zaptrem 3 days ago

      While that’s a reasonable opinion to have, it’s a fight they can’t really win. It’s like putting up a poster in a public square then running up to random people and shouting “no, this poster isn’t for you because I don’t like you, no looking!” Except the person they’re blocking is an unstoppable mega corporation that’s not even morally in the wrong imo (except for when they overburden people’s sites, that’s bad ofc)

      • guappa 3 days ago

        The looking is fine, the photographing and selling the photo less so… and fyi in denmark monuments have copyright so if you photograph and sell the photos you owe fees :)

jayrwren 4 days ago

literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html Maybe travis needs more google-fu. maybe that includes using duckduckgo?

  • Macha 4 days ago

    The top link when you search the title of the article is the article itself?

    I am shocked, shocked I say.