Why are anime catgirls blocking my access to the Linux kernel?
(lock.cmpxchg8b.com)816 points by taviso 4 days ago
816 points by taviso 4 days ago
That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.
This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.
Obviously the developer of Anubis thinks it is bypassing: https://github.com/TecharoHQ/anubis/issues/978
The point is that it will always be cheaper for bot farms to pass the challenge than for regular users.
Why does that matter? The challenge needs to stay expensive enough to slow down bots, but legitimate users won't be solving anywhere near the same amount of challenges and the alternative is the site getting crawled to death, so they can wait once in a while.
Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.
this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.
reducing the problem to a cost issue is bound to be short sighted.
This is not about preventing crawling entirely, it's about finding a way to prevent crawlers from repeatedly everything way too frequently just because crawling is just very cheap. Of course it will always be worth it to crawl the Linux Kernel mailing list, but maybe with a high enough cost per crawl the crawlers will learn to be fine with only crawling it once per hour for example
my comment is not about preventing crawling, its stating that with how much revenue AI is bringing (real or not), the value of crawling repeatedly >>> the cost of running these flimsy coin mining algorithms.
At the very least captcha at least tries to make the human-ai distinction, but these algorithms are just purely on the side of making it "expensive". if its just a capital problem, then its not a problem for these big corpo who are the ones who are incentivized to do so in the first place!
even if human captcha solvers are involved, at the very least it provides the society with some jobs (useless as it may be), but these mining algorithms also do society no good, and wastes compute for nothing!
My biggest bitch is that it requires JS and cookies...
Although the long term problem is the business model of servers paying for all network bandwidth.
Actual human users have consumed a minority of total net bandwidth for decades:
https://www.atom.com/blog/internet-statistics/
Part 4 shows bots out using humans in 1996 8-/
What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.
The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.
This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.
So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.
Rational predictions are that it's not going to end well...
"Although the long term problem is the business model of servers paying for all network bandwidth."
Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.
The Ai scrapers are most likely vc funded and all they care about is getting as much data as possible and not worry about the costs.
They are hiring machines at scale too so definitely bandwidth etc. are cheaper for them too. Maybe use a provider that doesn't have too much bandwidth issues (hetzner?)
But still, the point being that you might be hosting website on your small server and that scraper with its machines beast can come and effectively ddos your server looking for data to scrape. Deterring them is what matters so that the economical scale finally slide back to our favours again.
Maybe my statement wasn't clear. The point is that the server operators pay for all of the bandwidth of access to their servers.
When this access is beneficial to them, that's OK, when it's detrimental to them, they're paying for their own decline.
The statement isn't really concerned with what if anything the scraper operators are paying, and I don't think that really matters in reaching the conclusion.
> The difference between that and the LLM training data scraping
Is the traffic that people are complaining about really training traffic?
My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.
That doesn't seem like enough traffic to be a really big problem.
On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.
Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.
Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.
So what's really going on here? Anybody actually know?
The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.
There's some user-directed traffic, but it's a small fraction, in my experience.
It's not random internet people saying it's training. It's Cloudflare, among others.
Search for “A graph of daily requests over time, comparing different categories of AI Crawlers” on this blog: https://blog.cloudflare.com/ai-labyrinth/
In the feed today:
AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders
The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.
That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.
But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?
I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?
The questions just multiply.
It's near-stock mediawiki, so it has a ton of old versions and diffs off the history tab but I'd expect a crawler to be able to handle it.
What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?
If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.
Search engines, at least, are designed to index the content, for the purpose of helping humans find it.
Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.
This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."
> copyright attribution
You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.
LLMs quite literally work at the level of their source material, that's how training works, that's how RAG works, etc.
There is no proof that LLMs work at the level of "ideas", if you could prove that, you'e solve a whole lot of incredibly expensive problems that are current bottlenecks for training and inference.
It is a bit ironic that you'd call someone wanting to control and be paid for the thing they themselves created "selfish", while at the same time writing apologia on why it's okay for a trillion dollar private company to steal someone else's work for their own profit.
It isn't some moral imperative that OpenAI gets access to all of humanity's creations so they can turn a profit.
As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.
Same, ClaudeBot makes a stupid amount of requests on my git storage. I just blocked them all on Cloudflare.
As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.
Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.
Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.
>Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.
a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.
Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163
My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.
It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.
Why haven't they been sued and jailed for DDoS, which is a felony?
Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.
High volume and inorganic traffic patterns. Wikimedia wrote about it here: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...
they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens
either way the result is the same: they induce massive load
well written crawlers will:
- not hit a specific ip/host more frequently than say 1 req/5s
- put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
- limit crawling depth based on crawled page quality and/or response time
- respect robots.txt
- make it easy to block them
- wait 2 seconds for a page to load before aborting the connection
- wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down
I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites
As I've been saying for a while now - if you want to filter for only humans, ask questions only a human can easily answer; counting the number of letters in a word seems to be a good way to filter out LLMs, for example. Yes, that can be relatively easily gotten around, just like Anubis, but with the benefit that it doesn't filter out humans and has absolutely minimal system requirements (a browser that can submit HTML forms), possibly even less than the site itself.
There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)
For smaller forums, any customization to the new account process will work. When I ran a forum that was getting a frustratingly high amount of spammer signups, I modified the login flow to ask the user to add 1 to the 6-digit number in the stock CAPTCHA. Spam signups dropped like a rock.
> counting the number of letters in a word seems to be a good way to filter out LLMs
As long as this challenge remains obscure enough to be not worth implementing special handlers in the crawler, this sounds a neat idea.
But I think if everyone starts doing this particular challenge (char count), the crawlers will start instructing a cheap LLM to do appropriate tool calls and get around it. So the challenge needs to be obscure.
I wonder if anyone tried building a crawler-firewall or even nginx script which will let the site admin plug their own challenge generator in lua or something, which would then create a minimum HTML form. Maybe even vibe code it :)
... looks like they did: https://github.com/TecharoHQ/anubis/pull/1004, timestamped a few hours after your comment.
lmfao so that kinda defeats the entire point of this project if they have to resort to a manual IP blocklist anyways
I would actually say that it's been successful in determining at least one, so far, large scale abuser, which can the be blocked via more traditional methods.
I have my own project that finds malicious traffic IP addresses, and through searching through the results, it's allowed me to identify IP address ranges to be blocked completely.
Yielding useful information may not have been what it was designed to do, but it's still a useful outcome. Funny thing about Anubis' viral popularity is that it was designed to just protect the author's personal site from a vast army of resource-sucking marauders, and grew because it was open sourced and a LOT of other people found it useful and effective.
I think that was already common knowledge as hansjorg above suggests
From tjhorner on this same thread
"Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project."
So its meant/preferred to block low effort crawlers which can still cause damage if you don't deal with them. a 3 second deterrent seems good in that regard. Maybe the 3 second deterrent can come as in rate limiting an ip? but they might use swath's of ip :/
Anubis exists specifically to handle the problem of bots dodging IP rate limiting. The challenge is tied to your IP, so if you're cycling IPs with every request, you pay dramatically more PoW than someone using a single IP. It's intended to be used in depth with IP rate limiting.
Yea I'm not convinced unless somehow the vast majority of scrapers aren't already using headless browsers (which I assume they are). I feel like all this does is warm the planet.
Article might be a bit shallow, or maybe my understanding of how Anubis works is incorrect?
1. Anubis makes you calculate a challenge.
2. You get a "token" that you can use for a week to access the website.
3. (I don't see this being considered in the article) "token" that is used too much is rate limited. Calculating a new token for each request is expensive.
That, but apparently also restrictions on what tech you can use to access the website:
- https://news.ycombinator.com/item?id=44971990 person being blocked with `message looking something like "you failed"`
- https://news.ycombinator.com/item?id=44970290 mentions of other requirements that are allegedly on purpose to block older clients (as browser emulators presumably often would appear to be, because why would they bother implementing newer mechanisms when the web has backwards compatibility)
That's the basic principle. It's a tool to fight to crawlers that spam requests without cookies to prevent rate limiting.
The Chinese crawlers seem to have adjusted their crawling techniques to give their browsers enough compute to pass standard Anubis checks.
Whenever I see an otherwise civil and mature project utilize something outwardly childish like this I audibly groan and close the page.
I'm sure the software behind it is fine but the imagery and style of it (and the confidence to feature it) makes me doubt the mental credibility/social maturity of anybody willing to make it the first thing you see when accessing a webpage.
Edit: From a quick check of the "CEO" of the company, I was unsurprised to have my concerns confirmed. I may be behind the times but I think there are far too many people in who act obnoxiously (as part of what can only be described as a new subculture) in open source software today and I wish there were better terms to describe it.
So... Is Anubis actually blocking bots because they didn't bother to circumvent it?
Basically. Anubis is meant to block mindless, careless, rude bots with seemingly no technically proficient human behind the process; these bots tend to be very aggressive and make tons of requests bringing sites down.
The assumption is that if you’re the operator of these bots and care enough to implement the proof of work challenge for Anubis you could also realize your bot is dumb and make it more polite and considerate.
Of course nothing precludes someone implementing the proof of work on the bot but otherwise leaving it the same (rude and abusive). In this case Anubis still works as a somewhat fancy rate limiter which is still good.
> I host this blog on a single core 128MB VPS
Where does one even find a VPS with such small memory today?
Or software to run on it. I'm intrigued about this claim as well.
The software is easy. Apt install debian apache2 php certbot and you're pretty much set to deploy content to /var/www. I'm sure any BSD variant is also fine, or lots of other software distributions that don't require a graphical environment
On an old laptop running Windows XP (yes, with GUI, breaking my own rule there) I've also run a lot of services, iirc on 256MB RAM. XP needed about 70 I think, or 52 if I killed stuff like Explorer and unnecessary services, and the remainder was sufficient to run a uTorrent server, XAMPP (Apache, MySQL, Perl and PHP) stack, Filezilla FTP server, OpenArena game server, LogMeIn for management, some network traffic monitoring tool, and probably more things I'm forgetting. This ran probably until like 2014 and I'm pretty sure the site has been on the HN homepage with a blog post about IPv6. The only thing that I wanted to run but couldn't was a Minecraft server that a friend had requested. You can do a heck of a lot with a hundred megabytes of free RAM but not run most Javaware :)
What I meant is that I’m not sure it will even boot. Bookworm minimum requirements are 256MB of RAM.
https://www.debian.org/releases/bookworm/armel/ch03s04.en.ht...
128MB should be plenty. I used systems for years with much less. But in reality, Linux is much heavier these days.
> PoW is minor for botters
But still enough to prevent a billion request DDoS
These sites have been search engine scrapped forever. It’s not about blocking bots entirely just about this new wave of fuck you I don’t care if your host goes down quasi malicious scrappers
Yes, but a single bot is not a concern. It's the first "D" in DDoS that makes it hard to handle
(and these bots tend to be very, very dumb - which often happens to make them more effective at DDoSing the server, as they're taking the worst and the most expensive ways to scrape content that's openly available more efficiently elsewhere)
I don't care that they use anime catgirls.
What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net.
I hate Amazon's failure pets, I hate google's failure mini-games -- it strikes me as an organizational effort to get really good at failing rather than spending that same effort to avoid failures all together.
It's like everyone collectively thought the standard old Apache 404 not found page was too feature-rich and that customers couldn't handle a 3 digit error, so instead we now get a "Whoops! There appears to be an error! :) :eggplant: :heart: :heart: <pet image.png>" and no one knows what the hell is going on even though the user just misplaced a number in the URL.
This is something I've always felt about design in general. You should never make it so that a symbol for an inconvenience appears happy or smug, it's a great way to turn people off your product or webpage.
Reddit implemented something a while back that says "You've been blocked by network security!" with a big smiling Reddit snoo front and centre on the page and every time I bump into it I can't help but think this.
> What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net
This is probably intentional. They offer an paid unbranded version. If they had a corporate friendly brand on the free offering, then there would be fewer people paying for the unbranded one.
The original versions were a way to make fun even a boring event such as a 404. If the page stops conveying the type of error to the user then it's just bad UX but also vomiting all the internal jargon to a non-tech user is bad UX.
So, I don't see an error code + something fun to be that bad.
People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today, so I don't see how having fun error pages to be such an issue?
This assumes it's fun.
Usually when I hit an error page, and especially if I hit repeated errors, I'm not in the mood for fun, and I'm definitely not in the mood for "fun" provided by the people who probably screwed up to begin with. It comes off as "oops, we can't do anything useful, but maybe if we try to act cute you'll forget that".
Also, it was more fun the first time or two. There's a not a lot of orginal fun on the error pages you get nowadays.
> People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today
It's been a while, but I don't remember much gratuitous cutesiness on the 90s Web. Not unless you were actively looking for it.
> This assumes it's fun.
Not to those who don't exist in such cultures. It's creepy, childish, strange to them. It's not something they see in everyday life, nor would I really want to. There is a reason why cartoons are aimed for younger audiences.
Besides if your webserver is throwing errors, you've configured it incorrectly. Those pages should be branded as the site design with a neat and polite description to what the error is.
I can't find any documentation that says Anubis does this, (although it seems odd to me that it wouldn't, and I'd love a reference) but it could do the following:
1. Store the nonce (or some other identifier) of each jwt it passes out in the data store
2. Track the number or rate of requests from each token in the data store
3. If a token exceeds the rate limit threshold, revoke the token (or do some other action, like tarpit requests with that token, or throttle the requests)
Then if a bot solves the challenge it can only continue making requests with the token if it is well behaved and doesn't make requests too quickly.
It could also do things like limit how many tokens can be given out to a single ip address at a time to prevent a single server from generating a bunch of tokens.
On my daily browser with V8 JIT disabled, Cloudflare Turnstile has the worst performance hit, and often requires an additional click to clear.
Anubis usually clears in with no clicks and no noticeable slowdown, even with JIT off. Among the common CAPTCHA solutions it's the least annoying for me.
I always wondered about these anti bot precautions... as a firefox user, with ad blocking and 3rd party cookies disabled, i get the goddamn captcha or other random check (like this) on a bunch of pages now, every time i visit them...
Is it worth it? Millions of users wasting cpu and power for what? Saving a few cents on hosting? Just rate limit requests per second per IP and be done.
Sooner or later bots will be better at captchas than humans, what then? What's so bad with bots reading your blog? When bots evolve, what then? UK style, scan your ID card before you can visit?
The internet became a pain to use... back in the time, you opened the website and saw the content. Now you open it, get an antibot check, click, forward to the actual site, a cookie prompt, multiple clicks, then a headline + ads, scroll down a milimeter... do you want to subscribe to a newsletter? Why, i didn't even read the first sentence of the article yet... scroll down.. chat with AI bot popup... a bit further down login here to see full article...
Most of the modern web is unusable. I know I'm ranting, but this is just one of the pieces of a puzzle that makes basic browsing a pain these days.
Anubis works because AI crawlers do very little requests from an ip address to bypass rate-limiting. Last year they could still be blocked by ip range, but now the requests are from so many different networks that doesn't work anymore.
Doing the proof-of-work for every request is apparently too much work for them.
Crawlers using a single ip, or multiple ips from a single range are easily identifiable and rate-limited.
Good on you that you found a solution to myself but personally I will just not use websites that pull this and not contribute to projects where using such a website is required. If you respect me so little that you will make demands about how I use my computer and block me as a bot if I don't comply then I am going to assume that you're not worth my time.
With the asymmetry of doing the PoW in javascript versus compiled c code, I wonder if this type of rate limiting is ever going to be directly implemented into regular web browsers. (I assume there's already plugins for curl/wget)
Other than Safari, mainstream browsers seem to have given up on considering browsing without javascript enabled a valid usecase. So it would purely be a performance improvement thing.
Apple supports people that want to not use their software as the gods at Apple intended it? What parallel universe Version of Apple is this!
Seriously though, does anything of Apple's work without JS, like Icloud or Find my phone? Or does Safari somehow support it in a way that other browsers don't?
Last I checked, safari still had a toggle to disable javascript long after both chrome and firefox removed theirs. That's what I was referring to.
> The idea of “weighing souls” reminded me of another anti-spam solution from the 90s… believe it or not, there was once a company that used poetry to block spam!
> Habeas would license short haikus to companies to embed in email headers. They would then aggressively sue anyone who reproduced their poetry without a license. The idea was you can safely deliver any email with their header, because it was too legally risky to use it in spam.
Kind of a tangent but learning about this was so fun. I guess it's ultimately a hack for there not being another legally enforceable way to punish people for claiming "this email is not spam"?
IANAL so what I'm saying is almost certainly nonsense. But it seems weird that the MIT license has to explicitly say that the licensed software comes with no warranty that it works, but that emails don't have to come with a warranty that they are not spam! Maybe it's hard to define what makes an email spam, but surely it is also hard to define what it means for software to work. Although I suppose spam never e.g. breaks your centrifuge.
The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.
I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.
The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?
The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.
That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.
Oh, its time to bring Internet back to humans. Maybe its time to treat first layer of Internet just as transport. Then, layer large VPN networks and put services there. People will just VPN to vISP to reach content. Different networks, different interests :) But this time dont fuck up abuse handling. Someone is doing something fishy? Depeer him from network (or his un-cooperating upstream!).
I think the solution to captcha-rot is micro-payments. It does consume resources to serve a web-page so whose gonna pay for that?
If you want to do advertisement then don't require a payment, and be happy that crawlers will spread your ad to the users of AI-bots.
If you are a non-profit-site then it's great to get a micro-payment to help you maintain and run the site.
Something feels bizarrely incongruent about the people using Anubis. These people used to be the most vehemently pro-piracy, pro internet freedom and information accessibility, etc.
Yet now when it's AI accessing their own content, suddenly they become the DMCA and want to put up walls everywhere.
I'm not part of the AI doomer cult like many here, but it would seem to me that if you publish your content publicly, typically the point is that it would be publicly available and accessible to the world...or am I crazy?
As everything moves to AI-first, this just means nobody will ever find your content and it will not be part of the collective human knowledge. At which point, what's the point of publishing it.
In case you're genuinely confused, the reason for Anubis and similar tools is that AI-training-data-scraping crawlers are assholes, and strangle the living shit out of any webserver they touch, like a cloud of starving locusts descending upon a wheat field.
i.e. it's DDoS protection.
> an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources
Sure, if you ignore that humans click on one page and the problematic scrapers (not the normal search engine volume, but the level we see nowadays where misconfigured crawlers go insane on your site) are requesting many thousands to millions of times more pages per minute. So they'll need many many times the compute to continue hammering your site whereas a normal user can muster to load that one page from the search results that they were interested in
> I host this blog on a single core 128MB VPS
No wonder the site is being hugged to death. 128MB is not a lot. Maybe it's worth to upgrade if you post to hacker news. Just a thought.
Still, 128MB is not enough to even run Debian let alone Apache/NGINX. I’m on my phone, but it doesn’t seem like the author is using Cloudflare or another CDN. I’d like to know what they are doing.
Moving bytes around doesn't take RAM but CPU. Notice how switches don't advertise how many gigabytes of RAM they have, but can push a few gigabits of content around between all 24 ports at once without even going expensive
Also, the HN homepage is pretty tame so long as you don't run WordPress. You don't get more than a few requests per second, so multiply that with the page size (images etc.) and you probably get a few megabits as bandwidth, no problem even for a Raspberry Pi 1 if the sdcard can read fast enough or the files are mapped to RAM by the kernel
I write about something similar a while back https://maori.geek.nz/proof-of-human-2ee5b9a3fa28
About the difficulty of proving you are human especially when every test built has so much incentive to be broken. I don't think it will be solved, or could ever be solved.
Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.
That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.
Your link explicitly says:
> It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.
It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.
It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.
Here's a more relevant quote from the link:
> Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.
As the article notes, the work required is negligible, and as the linked post notes, that's by design. Wasting scraper compute is part of the picture to be sure, but not really its primary utility.
Why require proof of work with difficulty at all then? Just have no UI other than (javascript) required and run a trivial computation in WASM as a way of testing for modern browser features. That way users don't complain that it is taking 30s on their low-end phone and it doesn't make it any easier for scrapers to scrape (because the PoW was trivial anyways).
The compute also only seems to happen once, not for every page load, so I'm not sure how this is a huge barrier.
Anubis is based on hashcash concepts - just adapted to a web request flow. Basically the same thing - moderately expensive for the sender/requester to compute, insanely cheap for the server/recipient to verify.
We need bitcoin-based lightning nano-payments for such things. Like visiting the website will cost $0.0001 cent, the lightning invoice is embedded in the header and paid for after single-click confirmation or if threshold is under a pre-configured value. Only way to deal with AI crawlers and future AI scams.
With the current approach we just waste the energy, if you use bitcoin already mined (=energy previously wasted) it becomes sustainable.
We deployed hashcash for a while back in 2004 to implement Picasa's email relay - at the time it was a pretty good solution because all our clients were kind of similar in capability. Now I think the fastest/slowest device is a broader range (just like Tavis says), so it is harder to tune the difficulty for that.
Just use Anubis Bypass: https://addons.mozilla.org/en-US/android/addon/anubis-bypass...
Haven't seen dumb anime characters since.
Hmm... What if instead of using plain SHA-256 it was a dynamically tweaked hash function that forced the client to run it in JS?
No, the economics will never work out for a Proof of Work-based counter-abuse challenge. CPU is just too cheap in comparison to the cost of human latency. An hour of a server CPU costs $0.01. How much is an hour of your time worth?
That's all the asymmetry you need to make it unviable. Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users. So there's no point in theorizing about an attacker solving the challenges cheaper than a real user's computer, and thus no point in trying to design a different proof of work that's more resistant to whatever trick the attackers are using to solve it for cheap. Because there's no trick.
But for a scraper to be effective it has to load orders of magnitude more pages than a human browses, so a fixed delay causes a human to take 1.1x as long, but it will slow down scraper by 100x. Requiring 100x more hardware to do the same job is absolutely a significant economic impediment.
>An hour of a server CPU costs $0.01. How much is an hour of your time worth?
That's irrelevant. A human is not going to be solving the challenge by hand, nor is the computer of a legitimate user going to be solving the challenge continuously for one hour. The real question is, does the challenge slow down clients enough that the server does not expend outsized resources serving requests of only a few users?
>Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users.
No, I disagree. If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
The human needs to wait for their computer to solve the challenge.
You are trading something dirt-cheap (CPU time) for something incredibly expensive (human latency).
Case in point:
> If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
No. A human sees a 10x slowdown. A human on a low end phone sees a 50x slowdown.
And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)
That is not an effective deterrent. And there is no difficulty factor for the challenge that will work. Either you are adding too much latency to real users, or passing the challenge is too cheap to deter scrapers.
Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.
wait but then why bother with this PoW system at all? if they're just trying to block anyone without JS that's way easier and doesn't require slowing things down for end users on old devices.
reminds of how wikipedia literally has all the data available even in a nice format just for scrapers (I think) and even THEN, there are some scrapers which still scraped wikipedia and actually made wikipedia lose some money so much that I am pretty sure that some official statement had to be made or they disclosed about it without official statement.
Even then, man I feel like you yourself can save on so many resources (both yours) and (wikipedia) if scrapers had the sense to not scrape wikipedia and instead follow wikipedia's rules
If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.
And Codeberg, even behind Anubis, is not immune from scrapers either
> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.
A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
Secondly, Anubis specifically targets bots that try to blend in with human traffic. Bots that don't try to blend in with humans are basically ignored and out-of-scope. Most malicious bots don't want to be targeted, so they want to blend in... so they kind of have to deal with this. If they want to avoid the Anubis challenge, they have to essentially identify themselves. If not, they have to solve it.
Finally... If bots really want to durably be able to pass Anubis challenges, they pretty much have no choice but to run the arbitrary code. Anything else would be a pretty straight-forward cat and mouse game. And, that means that being able to accelerate the challenge response is a non-starter: if they really want to pass it, and not appear like a bot, the path of least resistance is to simply run a browser. That's a big hurdle and definitely does increase the complexity of scraping the Internet. It increases more the more sites that use this sort of challenge system. While the scrapers have more resources, tools like Anubis scale the resources required a lot more for scraping operations than it does a specific random visitor.
To me, the most important point is that it only fights bot traffic that intentionally tries to blend in. That's why it's OK that the proof-of-work challenge is relatively weak: the point is that it's non-trivial and can't be ignored, not that it's particularly expensive to compute.
If bots want to avoid the challenge, they can always identify themselves. Of course, then they can also readily be blocked, which is exactly what they want to avoid.
In the long term, I think the success of this class of tools will stem from two things:
1. Anti-botting improvements, particularly in the ability to punish badly behaved bots, and possibly share reputation information across sites.
2. Diversity of implementations. More implementations of this concept will make it harder for bots to just hardcode fastpath challenge response implementations and force them to actually run the code in order to pass the challenge.
I haven't kept up with the developments too closely, but as silly as it seems I really do think this is a good idea. Whether it holds up as the metagame evolves is anyone's guess, but there's actually a lot of directions it could be taken to make it more effective without ruining it for everyone.
> A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
... has phpbb not heard of the old "only create the session on the second visit, if the cookie was successfully created" trick?
We have been seeing our clients' sites being absolutely *hammered* by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.
Personally I have no issues with AI bots, that properly identify themselves, from scraping content as if the site operator doesn't want it to happen they can easily block the offending bot(s).
We built our own proof-of-work challenge that we enable on client sites/accounts as they come under 'attack' and it has been incredible how effective it is. That said I do think it is only a matter of time before the tactics change and these "malicious" AI bots are adapted to look more human / like real browsers.
I mean honestly it wouldn't be _that_ hard to enable them to run javascript or to emulate a real/accurate User-Agent. That said they could even run headless versions of the browser engines...
It's definitely going to be cat-and-mouse.
The most brutal honest truth is that if they throttled themselves as not to totally crash whatever site they're trying to scrape we'd probably have never noticed or gone through the trouble of writing our own proof-of-work challenge.
Unfortunately those writing/maintaining these AI bots that hammer sites to death probably either have no concept of the damage it can do or they don't care.
> We have been seeing our clients' sites being absolutely hammered by AI bots trying to blend in. Some of the bots use invalid user agents - they _look_ valid on the surface, but under the slightest scrutiny, it becomes obvious they're not real browsers.
Yep. I noticed this too.
> That said they could even run headless versions of the browser engines...
Yes, exactly. To my knowledge that's what's going on with the latest wave that is passing Anubis.
That said, it looks like the solution to that particular wave is going to be to just block Huawei cloud IP ranges for now. I guess a lot of these requests are coming from that direction.
Personally though I think there are still a lot of directions Anubis can go in that might tilt this cat and mouse game a bit more. I have some optimism.
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans. > Anubis – confusingly – inverts this idea.
Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.
Isn’t animus a dog? So it should be anime dog/wolf girl rather than cat girl?
Yes, Anubis is a dog-headed or jackal-headed god. I actually can't find anywhere on the Anubis website where they talk about their mascot; they just refer to her neutrally as the "default branding".
Since dog girls and cat girls in anime can look rather similar (both being mostly human + ears/tail), and the project doesn't address the point outright, we can probably forgive Tavis for assuming catgirl.
I read hackernews on my phone when I'm bored and I've seen it a lot lately. I don't think I've ever seen it on my desktop.
Would it not be more effective just to require payment for accessing your website? Then you don't need to care about bot or not.
For the same reason why cats sit on your keyboard. Because they can
Surely the difficulty factor scales with the system load?
Can we talk about the "sexy anime girl" thing? Seems it's popular in geek/nerd/hacker circles and I for one don't get it. Browsing reddit anonymously you're flooded with near-pornographic fan-made renders of these things, I really don't get the appeal. Can someone enlighten me?
It's a good question. Anime (like many media, but especially anime) is known to have gratuitous fan service where girls/women of all ages are in revealing clothing for seemingly no reason except to just entice viewers.
The reasoning is that because they aren't real people, it's okay to draw and view images of anime, regardless of their age. And because geek/nerd circles tend not to socialize with real women, we get this over-proliferation of anime girls.
2D girls don't nag and I've never had to clear their clogged hair out of my shower drain.
We're 1-2 years away from putting the entire internet behind Cloudflare, and Anubis is what upsets you? I really don't get these people. Seeing an anime catgirl for 1-2 seconds won't kill you. It might save the internet though.
The principle behind Anubis is very simple: it forces every visitor to brute force a math problem. This cost is negligible if you're running it on your computer or phone. However, if you are running thousands of crawlers in parallel, the cost adds up. Anubis basically makes it expensive to crawl the internet.
It's not perfect, but much much better than putting everything behind Cloudflare.
So it's a paywall with -- good intentions -- and even more accessibility concerns. Thus accelerating enshittification.
Who's managing the network effects? How do site owners control false positives? Do they have support teams granting access? How do we know this is doing any good?
It's convoluted security theater mucking up an already bloated , flimsy and sluggish internet. It's frustrating enough to guess schoolbuses every time I want to get work done, now I have to see porfnified kitty waifus
(openwrt is another community plagued with this crap)
here is the community post with Anubis pro / con experiences https://forum.openwrt.org/t/trying-out-anubis-on-the-wiki/23...
i suppose one nice property is that it is trivially scalable. if the problem gets really bad and the scrapers have llms embedded in them to solve captchas, the difficulty could be cranked up and the lifetime could be cranked down. it would make the user experience pretty crappy (party like it's 1999) but it could keep sites up for unauthenticated users without engaging in some captcha complexity race.
it does have arty political vibes though, the distributed and decentralized open source internet with guardian catgirls vs. late stage capitalism's quixotic quest to eat itself to death trying to build an intellectual and economic robot black hole.
You needed to have a security contact on your website, or at least in the repo. You did not. You assumed security researchers would instead back out to your Github account's repository list, find the .github repository, and look for a security policy there. That's not a thing!
I'm really surprised you wrote this.
The security policy that didn't exist until a few hours ago?
Added on March 18: https://github.com/TecharoHQ/.github/commits/main/SECURITY.m...
Copied to the root of the repo after the disclosure
ref: https://github.com/TecharoHQ/anubis/issues/1002#issuecomment...
Isn't using an anime catgirl avatar the exact opposite of "look at meee"?
no. it's someone wanting attention and feeling ok creating an interstitial page to capture your attention which does not prove you're a human while saying that the page proves you're human.
the entire thing is ridiculous. and only those who see no problem shoving anime catgirls into the face of others will deploy it. maybe that's a lot of people; maybe only I object to this. The reality is that there's no technical reason to deploy it, as called out in the linked blog article, so the only reason to do this is a "look at meee" reason, or to announce that one is a fan of this kind of thing, which is another "look at meee"-style reason.
Why do I object to things like this? Because once you start doing things like this, doing things for attention, you must continually escalate in order to keep capturing that attention. Ad companies do this, and they don't see a problem in escalation, and they know they have to do it. People quickly learn to ignore ads, so in order to make your page loads count as an advertiser, you must employ means which draw attention to your ads. It's the same with people who no longer draw attention to themselves because they like anime catgirls. Now they must put an interstitial page up to force you to see that they like anime catgirls. We've already established that the interstitial page accomplishes nothing other than showing you the image, so showing the image must be the intent. That is what I object to.
you are overthinking
it's a simple as having a nice picture there make this whole thing feel nicer, and give it a bit of personality
so you put in some picture/art you like
that's it
similar any site sing it can change that picture, but there isn't any fundamental problem with the picture, so most can't care to change it
>Please call me (order of preference): They/them or She/her please.
Take a wild guess
Well that tells me the rough cultural area alt187 might be pointing to, but I could use some more clarity. Are you saying that pronouns/transgender/queerness are the ideology? Or are you showing them as shibboleths of a broader ideological tendency that's prevailing in FOSS?
For lack of a better term one might describe the ideology as "woke"
Honestly I am okay with anime catgirls since I just find it funny but still it would be cool to see linux related stuff. Imagine mr tux penguin gif of him racing in like supertuxcart for the linux website.
sourcehut also uses anubis but they have removed the anime catgirl thing with their own logo, I think disroot also does that I am not sure though
Why does Anubis not leverage PoW from its users to do something useful (at best, distributed computing for science, at worst, a crypto-currency at least allowing the webmasters to get back some cash)
People are already complaining. Could you imagine how much fodder this'd give people who didn't like the work or the distribution of any funds that a cryptocurrency would create (which would be pennies, I think, and more work to distribute than would be worth doing).
I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?
I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.
Lots of companies run these kind of crawlers now as part of their products.
They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.
There are lots of companies around that you can buy this type of proxy service from.
You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.
Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world
Source: Cloudflare
https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
Perplexity's defense is that they're not doing it for training/KB building crawls but for answering dynamic queries calls and this is apparently better.
Kernel.org* just has to actually configure Anubis rather than deploying the default broken config. Enable the meta-refresh proof of work rather than relying on the corporate browsers only bleeding edge javascript application proof of work.
* or whatever site the author is talking about, his site is currently inaccessible due to the amount of people trying to load it.
If people are truly concerned about the crawlers hammering their 128mb raspberry pi website then a better solution would be to provide an alternative way for scrapers to access the data (e.g., voluntarily contribute a copy of their public site to something like common crawl).
If Anubis blocked crawler requests but helpfully redirected to a giant tar ball of every site using their service (with deltas or something to reduce bandwidth) I bet nobody would bother actually spending the time to automate cracking it since it’s basically negative value. You could even make it a torrent so most of the be costs are paid by random large labs/universities.
I think the real reason most are so obsessed with blocking crawlers is they want “their cut”… an imagined huge check from OpenAI for their fan fiction/technical reports/whatever.
No, this doesn’t work. Many of the affected sites have these but they’re ignored. We’re talking about git forges, arguably the most standardised tool in the industry, where instead of just fetching the repository every single history revision of every single file gets recursively hammered to death. The people spending the VC cash to make the internet unusable right now don’t know how to program. They especially don’t give a shit about being respectful. They just hammer all the sites, all the time, forever.
I'm generally very pro-robot (every web UA is a robot really IMO) but these scrapers are exceptionally poorly written and abusive.
Plenty of organizations managed to crawl the web for decades without knocking things over. There's no reason to behave this way.
It's not clear to me why they've continued to run them like this. It seems so childish and ignorant.
The kind of crawlers/scrapers who DDoS a site like this aren't going to bother checking common crawl or tarballs. You vastly overestimate the intelligence and prosociality of what bursty crawler requests tend to look like. (Anyone who is smart or prosocial will set up their crawler to not overwhelm a site with requests in the first place - yet any site with any kind of popularity gets flooded with these requests sooner or later)
If they don’t have the intelligence to go after the more efficient data collection method then they likely won’t have the intelligence or willpower to work around the second part I mentioned (keeping something like Anubis). The only problem is when you put Anubis in the way of determined, intelligent crawlers without giving them a choice that doesn’t involve breaking Anubis.
> I think the real reason most are so obsessed with blocking crawlers is they want “their cut”…
I find that an unfair view of the situation. Sure, there are examples such as StackOverflow (which is ridiculous enough as they didn't make the content) but the typical use case I've seen on the small scale is "I want to self-host my git repos because M$ has ruined GitHub, but some VC-funded assholes are drowning the server in requests".
They could just clone the git repo, and then pull every n hours, but it requires specialized code so they won't. Why would they? There's no money in maintaining that. And that's true for any positive measure you may imagine until these companies are fined for destroying the commons.
While that’s a reasonable opinion to have, it’s a fight they can’t really win. It’s like putting up a poster in a public square then running up to random people and shouting “no, this poster isn’t for you because I don’t like you, no looking!” Except the person they’re blocking is an unstoppable mega corporation that’s not even morally in the wrong imo (except for when they overburden people’s sites, that’s bad ofc)
literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html Maybe travis needs more google-fu. maybe that includes using duckduckgo?
Seems like ai bots are indeed bypassing the challenge by computing it: https://social.anoxinon.de/@Codeberg/115033790447125787