Why are anime catgirls blocking my access to the Linux kernel?

leumon 4 days ago

Seems like ai bots are indeed bypassing the challenge by computing it: https://social.anoxinon.de/@Codeberg/115033790447125787

Reply View 19 replies

debugnik 4 days ago

That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.
This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.

Reply View | 17 replies
- nialv7 4 days ago
  
  Obviously the developer of Anubis thinks it is bypassing: https://github.com/TecharoHQ/anubis/issues/978
  
  Reply View | 7 replies
  
  debugnik 4 days ago
  
  Fair, then I obviously think Xe may have a kinda misguided understanding of their own product. I still stand by the concept I stated above.
  
  Reply View | 6 replies
- NoGravitas 3 days ago
  
  The point is that it will always be cheaper for bot farms to pass the challenge than for regular users.
  
  Reply View | 2 replies
  
  debugnik 3 days ago
  
  Why does that matter? The challenge needs to stay expensive enough to slow down bots, but legitimate users won't be solving anywhere near the same amount of challenges and the alternative is the site getting crawled to death, so they can wait once in a while.
  
  Reply View | 0 replies
  
  bawolff 3 days ago
  
  It might be a lot closer if they were using argon2 instead of sha. Sha is a kind of bad choice for this sort of thinh.
  
  Reply View | 0 replies
- hiccuphippo 4 days ago
  
  Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.
  
  Reply View | 2 replies
  
  bawolff 3 days ago
  
  Most of those alt-coins are kind of fake/scams. Its really hard to make it work with actually useful problems.
  
  Reply View | 0 replies
  
  kevincox 4 days ago
  
  Of course that doesn't directly help the site operator. Maybe it could actually do a bit of bitcoin mining for the site owner. Then that could pay for the cost of accessing the site.
  
  Reply View | 0 replies
- danieltanfh95 3 days ago
  
  this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.
  reducing the problem to a cost issue is bound to be short sighted.
  
  Reply View | 2 replies
  
  r0uv3n 3 days ago
  
  This is not about preventing crawling entirely, it's about finding a way to prevent crawlers from repeatedly everything way too frequently just because crawling is just very cheap. Of course it will always be worth it to crawl the Linux Kernel mailing list, but maybe with a high enough cost per crawl the crawlers will learn to be fine with only crawling it once per hour for example
  
  Reply View | 1 reply
  
  danieltanfh95 a day ago
  
  my comment is not about preventing crawling, its stating that with how much revenue AI is bringing (real or not), the value of crawling repeatedly >>> the cost of running these flimsy coin mining algorithms.
  At the very least captcha at least tries to make the human-ai distinction, but these algorithms are just purely on the side of making it "expensive". if its just a capital problem, then its not a problem for these big corpo who are the ones who are incentivized to do so in the first place!
  even if human captcha solvers are involved, at the very least it provides the society with some jobs (useless as it may be), but these mining algorithms also do society no good, and wastes compute for nothing!
  
  Reply View | 0 replies
[removed] 4 days ago

[deleted]

Reply View | 0 replies

johnea 4 days ago

My biggest bitch is that it requires JS and cookies...

Although the long term problem is the business model of servers paying for all network bandwidth.

Actual human users have consumed a minority of total net bandwidth for decades:

https://www.atom.com/blog/internet-statistics/

Part 4 shows bots out using humans in 1996 8-/

What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.

The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.

This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.

So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.

Rational predictions are that it's not going to end well...

Reply View 10 replies

jerf 4 days ago

"Although the long term problem is the business model of servers paying for all network bandwidth."
Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.

Reply View | 2 replies
- Imustaskforhelp 4 days ago
  
  The Ai scrapers are most likely vc funded and all they care about is getting as much data as possible and not worry about the costs.
  They are hiring machines at scale too so definitely bandwidth etc. are cheaper for them too. Maybe use a provider that doesn't have too much bandwidth issues (hetzner?)
  But still, the point being that you might be hosting website on your small server and that scraper with its machines beast can come and effectively ddos your server looking for data to scrape. Deterring them is what matters so that the economical scale finally slide back to our favours again.
  
  Reply View | 0 replies
- johnea 3 days ago
  
  Maybe my statement wasn't clear. The point is that the server operators pay for all of the bandwidth of access to their servers.
  When this access is beneficial to them, that's OK, when it's detrimental to them, they're paying for their own decline.
  The statement isn't really concerned with what if anything the scraper operators are paying, and I don't think that really matters in reaching the conclusion.
  
  Reply View | 0 replies
Hizonner 4 days ago

> The difference between that and the LLM training data scraping
Is the traffic that people are complaining about really training traffic?
My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.
That doesn't seem like enough traffic to be a really big problem.
On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.
Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.
Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.
So what's really going on here? Anybody actually know?

Reply View | 6 replies
- zerocrates 4 days ago
  
  The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.
  There's some user-directed traffic, but it's a small fraction, in my experience.
  
  Reply View | 0 replies
- ncruces 3 days ago
  
  It's not random internet people saying it's training. It's Cloudflare, among others.
  Search for “A graph of daily requests over time, comparing different categories of AI Crawlers” on this blog: https://blog.cloudflare.com/ai-labyrinth/
  
  Reply View | 1 reply
  
  johnea 3 days ago
  
  In the feed today:
  AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders
  https://www.theregister.com/2025/08/21/ai_crawler_traffic/
  
  Reply View | 0 replies
- Dylan16807 4 days ago
  
  The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.
  
  Reply View | 2 replies
  
  Hizonner 4 days ago
  
  That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.
  But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?
  I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?
  The questions just multiply.
  
  Reply View | 1 reply
  
  Dylan16807 4 days ago
  
  It's near-stock mediawiki, so it has a ton of old versions and diffs off the history tab but I'd expect a crawler to be able to handle it.
  
  Reply View | 0 replies

jimmaswell 4 days ago

What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?

Reply View 37 replies

themafia 4 days ago

If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.
Search engines, at least, are designed to index the content, for the purpose of helping humans find it.
Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.
This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."

Reply View | 5 replies
- jimmaswell 3 days ago
  
  > copyright attribution
  You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.
  
  Reply View | 2 replies
  
  heavyset_go 3 days ago
  
  LLMs quite literally work at the level of their source material, that's how training works, that's how RAG works, etc.
  There is no proof that LLMs work at the level of "ideas", if you could prove that, you'e solve a whole lot of incredibly expensive problems that are current bottlenecks for training and inference.
  It is a bit ironic that you'd call someone wanting to control and be paid for the thing they themselves created "selfish", while at the same time writing apologia on why it's okay for a trillion dollar private company to steal someone else's work for their own profit.
  It isn't some moral imperative that OpenAI gets access to all of humanity's creations so they can turn a profit.
  
  Reply View | 0 replies
  
  [removed] 3 days ago
  
  [deleted]
  
  Reply View | 0 replies
- [removed] 3 days ago
  
  [deleted]
  
  Reply View | 0 replies
- [removed] 3 days ago
  
  [deleted]
  
  Reply View | 0 replies
marvinborner 3 days ago

As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.
[1] https://types.pl/@marvin/114394404090478296

Reply View | 1 reply
- squaresmile 3 days ago
  
  Same, ClaudeBot makes a stupid amount of requests on my git storage. I just blocked them all on Cloudflare.
  
  Reply View | 0 replies
dilDDoS 4 days ago

As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.

Reply View | 2 replies
- benou 4 days ago
  
  Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.
  Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.
  
  Reply View | 1 reply
  
  johnnyanmac 4 days ago
  
  >Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.
  a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.
  
  Reply View | 0 replies
Philpax 4 days ago

Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163

Reply View | 21 replies
- zahlman 4 days ago
  
  Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?
  
  Reply View | 4 replies
  
  NobodyNada 4 days ago
  
  My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.
  It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.
  
  Reply View | 3 replies
- immibis 4 days ago
  
  Why haven't they been sued and jailed for DDoS, which is a felony?
  
  Reply View | 15 replies
  
  ranger_danger 4 days ago
  
  Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.
  
  Reply View | 12 replies
  
  Symbiote 3 days ago
  
  Many are using botnets, so it's not practical to find out who they are.
  
  Reply View | 1 reply
  
  immibis 3 days ago
  
  Then how do we know they are OpenAI?
  
  Reply View | 0 replies
ezrast 4 days ago

High volume and inorganic traffic patterns. Wikimedia wrote about it here: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...

Reply View | 0 replies
blibble 4 days ago
they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens
either way the result is the same: they induce massive load
well written crawlers will:
- not hit a specific ip/host more frequently than say 1 req/5s - put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain) - limit crawling depth based on crawled page quality and/or response time - respect robots.txt - make it easy to block them
Reply View | 2 replies
- Aachen 3 days ago
  
  - wait 2 seconds for a page to load before aborting the connection
  - wait for the previous request to finish before requesting the next page, since that would only induce more load, get even slower, and eventually take everything down
  I've designed my site to hold up to traffic spikes anyway and the bots I'm getting aren't as crazy as the ones I hear about from other, bigger website operators (like the OpenStreetMap wiki, still pretty niche), so I don't block much of them. Can't vet every visitor so they'll get the content anyway, whether I like it or not. But if I see a bot having HTTP 499 "client went away before page finished loading" entries in the access log, I'm not wasting my compute on those assholes. That's a block. I haven't had to do that before, in a decade of hosting my own various tools and websites
  
  Reply View | 0 replies
- [removed] 4 days ago
  
  [deleted]
  
  Reply View | 0 replies

userbinator 3 days ago

As I've been saying for a while now - if you want to filter for only humans, ask questions only a human can easily answer; counting the number of letters in a word seems to be a good way to filter out LLMs, for example. Yes, that can be relatively easily gotten around, just like Anubis, but with the benefit that it doesn't filter out humans and has absolutely minimal system requirements (a browser that can submit HTML forms), possibly even less than the site itself.

There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)

Reply View 7 replies

ack_complete 3 days ago

For smaller forums, any customization to the new account process will work. When I ran a forum that was getting a frustratingly high amount of spammer signups, I modified the login flow to ask the user to add 1 to the 6-digit number in the stock CAPTCHA. Spam signups dropped like a rock.

Reply View | 0 replies
never_inline 3 days ago

> counting the number of letters in a word seems to be a good way to filter out LLMs
As long as this challenge remains obscure enough to be not worth implementing special handlers in the crawler, this sounds a neat idea.
But I think if everyone starts doing this particular challenge (char count), the crawlers will start instructing a cheap LLM to do appropriate tool calls and get around it. So the challenge needs to be obscure.
I wonder if anyone tried building a crawler-firewall or even nginx script which will let the site admin plug their own challenge generator in lua or something, which would then create a minimum HTML form. Maybe even vibe code it :)

Reply View | 0 replies
soared 3 days ago

Tried and true method! An old video game forum named moparscape used to ask what mopar was and I always had to google it

Reply View | 2 replies
- Aachen 3 days ago
  
  Good thing modern bots can't do a web search!
  
  Reply View | 1 reply
  
  userbinator 2 days ago
  
  They will be as likely if not more so to fall victim to the large amount of misinformation... and AI-generated crap you'll find from doing so.
  
  Reply View | 0 replies
cm2012 3 days ago

There is a decent segment of the population that will gave a hard time with that.

Reply View | 1 reply
- wavemode 3 days ago
  
  So it's no different from real CAPTCHAs, then.
  
  Reply View | 0 replies

hansjorg 4 days ago

If you want a tip my friend, just block all of Huawei Cloud by ASN.

Reply View 4 replies

wging 3 days ago

... looks like they did: https://github.com/TecharoHQ/anubis/pull/1004, timestamped a few hours after your comment.

Reply View | 3 replies
- scratchyone 3 days ago
  
  lmfao so that kinda defeats the entire point of this project if they have to resort to a manual IP blocklist anyways
  
  Reply View | 2 replies
  
  BLKNSLVR 3 days ago
  
  I would actually say that it's been successful in determining at least one, so far, large scale abuser, which can the be blocked via more traditional methods.
  I have my own project that finds malicious traffic IP addresses, and through searching through the results, it's allowed me to identify IP address ranges to be blocked completely.
  Yielding useful information may not have been what it was designed to do, but it's still a useful outcome. Funny thing about Anubis' viral popularity is that it was designed to just protect the author's personal site from a vast army of resource-sucking marauders, and grew because it was open sourced and a LOT of other people found it useful and effective.
  
  Reply View | 1 reply
  
  sandywaffles 2 days ago
  
  I think that was already common knowledge as hansjorg above suggests
  
  Reply View | 0 replies

iefbr14 4 days ago

I wouldn't be surprised if just delaying the server response by some 3 seconds will have the same effect on those scrapers as Anubis claims.

Reply View 7 replies

kingstnap 4 days ago

There is literally no point wasting 3 seconds of a computer's time and it's expensive wasting 3 seconds of a person's time.
That is literally an anti-human filter.

Reply View | 5 replies
- Imustaskforhelp 4 days ago
  
  From tjhorner on this same thread
  "Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project."
  So its meant/preferred to block low effort crawlers which can still cause damage if you don't deal with them. a 3 second deterrent seems good in that regard. Maybe the 3 second deterrent can come as in rate limiting an ip? but they might use swath's of ip :/
  
  Reply View | 1 reply
  
  OkayPhysicist 4 days ago
  
  Anubis exists specifically to handle the problem of bots dodging IP rate limiting. The challenge is tied to your IP, so if you're cycling IPs with every request, you pay dramatically more PoW than someone using a single IP. It's intended to be used in depth with IP rate limiting.
  
  Reply View | 0 replies
- loeg 3 days ago
  
  Anubis easily wastes 3 seconds of a human's time already.
  
  Reply View | 0 replies
- psionides 3 days ago
  
  You've just described Anubis, yeah
  
  Reply View | 1 reply
  
  kingstnap 3 days ago
  
  I know, I read the article and that's the thesis.
  
  Reply View | 0 replies
ranger_danger 4 days ago

Yea I'm not convinced unless somehow the vast majority of scrapers aren't already using headless browsers (which I assume they are). I feel like all this does is warm the planet.

Reply View | 0 replies

jmclnx 4 days ago

>The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans

Not for me, I have nothing but a hard time solving CAPTCHAs, ahout 50% of the time I give up after 2 tries.

Reply View 1 reply

serf 4 days ago

it's still certainly trivial for you compared to mentally computing a SHA256 op.

Reply View | 0 replies

anarki8 3 days ago

Article might be a bit shallow, or maybe my understanding of how Anubis works is incorrect?

1. Anubis makes you calculate a challenge.

2. You get a "token" that you can use for a week to access the website.

3. (I don't see this being considered in the article) "token" that is used too much is rate limited. Calculating a new token for each request is expensive.

Reply View 2 replies

Aachen 3 days ago

That, but apparently also restrictions on what tech you can use to access the website:
- https://news.ycombinator.com/item?id=44971990 person being blocked with `message looking something like "you failed"`
- https://news.ycombinator.com/item?id=44970290 mentions of other requirements that are allegedly on purpose to block older clients (as browser emulators presumably often would appear to be, because why would they bother implementing newer mechanisms when the web has backwards compatibility)

Reply View | 0 replies
jeroenhd 3 days ago

That's the basic principle. It's a tool to fight to crawlers that spam requests without cookies to prevent rate limiting.
The Chinese crawlers seem to have adjusted their crawling techniques to give their browsers enough compute to pass standard Anubis checks.

Reply View | 0 replies

SnuffBox 2 days ago

Whenever I see an otherwise civil and mature project utilize something outwardly childish like this I audibly groan and close the page.

I'm sure the software behind it is fine but the imagery and style of it (and the confidence to feature it) makes me doubt the mental credibility/social maturity of anybody willing to make it the first thing you see when accessing a webpage.

Edit: From a quick check of the "CEO" of the company, I was unsurprised to have my concerns confirmed. I may be behind the times but I think there are far too many people in who act obnoxiously (as part of what can only be described as a new subculture) in open source software today and I wish there were better terms to describe it.