Comment by bakugo

Comment by bakugo 4 days ago

It's more likely that the project itself will disappear into irrelevance as soon as AI scrapers bother implementing the PoW (which is trivial for them, as the post explains) or figure out that they can simply remove "Mozilla" from their user-agent to bypass it entirely.

debugnik 4 days ago

> as AI scrapers bother implementing the PoW

That's what it's for, isn't it? Make crawling slower and more expensive. Shitty crawlers not being able to run the PoW efficiently or at all is just a plus. Although:

> which is trivial for them, as the post explains

Sadly the site's being hugged to death right now so I can't really tell if I'm missing part of your argument here.

> figure out that they can simply remove "Mozilla" from their user-agent

And flag themselves in the logs to get separately blocked or rate limited. Servers win if malicious bots identify themselves again, and forcing them to change the user agent does that.

Reply View 21 replies

throwawayffffas 4 days ago

> That's what it's for, isn't it? Make crawling slower and more expensive.
The default settings produce a computational cost of milliseconds for a week of access. For this to be relevant it would have to be significantly more expensive to the point it would interfere with human access.

Reply View | 6 replies
- mfost 3 days ago
  
  I thought the point (which the article misses) is that a token gives you an identity, and an identity can be tracked and rate limited.
  So a crawlers that goes very ethically and does very little strain on the server should indeed be able to crawl for a whole week on a cheap compute, one that hammers the server hard will not.
  
  Reply View | 1 reply
  
  throwawayffffas 2 days ago
  
  Sure but it's really cheap to mint new identities, each node on their scrapping cluster can mint hundreds of thousands of tokens per second.
  Provisioning new ips is probably more costly than calculating the tokens, at least with the default difficulty setting.
  
  Reply View | 0 replies
- seba_dos1 3 days ago
  
  ...unless you're sus, then the difficulty increases. And if you unleash a single scrapping bot, you're not a problem anyway. It's for botnets of thousands, mimicking browsers on residual connections to make them hard to filter out or rate limit, effectively DDoSing the server.
  Perhaps you just don't realize how much did the scraping load increase in the last 2 years or so. If your server can stay up after deploying Anubis, you've already won.
  
  Reply View | 3 replies
  
  dale_glass 3 days ago
  
  How is it going to hurt those?
  If it's an actual botnet, then it's hijacked computers belonging to other people, who are the ones paying the power bills. The attacker doesn't care that each computer takes a long time to calculate. If you have 1000 computers each spending 5s/page, then your botnet can retrieve 200 pages/s.
  If it's just a cloud deployment, still it has resources that vastly outstrip a normal person's.
  The fundamental issue is that you can't serve example.com slower than a legitimate user on a crappy 10 year old laptop could tolerate, because that starts losing you real human users. So if let's say say user is happy to wait 5 seconds per page at most, then this is absolutely no obstacle to a modern 128 core Epyc. If you make it troublesome to the 128 core monster, then no normal person will find the site usable.
  
  Reply View | 2 replies
shkkmo 4 days ago

The explanation of how the estimate is made is more detailed, but here is the referenced conclusion:
>> So (11508 websites * 2^16 sha256 operations) / 2^21, that’s about 6 minutes to mine enough tokens for every single Anubis deployment in the world. That means the cost of unrestricted crawler access to the internet for a week is approximately $0.
>> In fact, I don’t think we reach a single cent per month in compute costs until several million sites have deployed Anubis.

Reply View | 12 replies
- kbelder 3 days ago
  
  If you use one solution to browse the entire site, you're linking every pageload to the same session, and can then be easily singled out and blocked. The idea that you can scan a site for a week by solving the riddle once is incorrect. That works for non-abusers.
  
  Reply View | 1 reply
  
  shkkmo 2 days ago
  
  Well, since they can get a unique token for every site every 6 minutes only using a free GCP VPS that doesn't really matter, scraping can easily be spread out across tokens or they can cheaply and quickly get a new one whenever the old one gets blocked.
  
  Reply View | 0 replies
- hiccuphippo 4 days ago
  
  Wasn't sha256 designed to be very fast to generate? They should be using bcrypt or something similar.
  
  Reply View | 2 replies
  
  throwawayffffas 4 days ago
  
  Unless they require a new token for each new request or every x minutes or something it won't matter.
  And as the poster mentioned if you are running an AI model you probably have GPUs to spare. Unlike the dev working from a 5 year old Thinkpad or their phone.
  
  Reply View | 1 reply
  
  _flux 3 days ago
  
  Apparently bcrypt has design that makes it difficult to accelerate effectively on a GPU.
  Indeed a new token should be requested per request; the tokens could also be pre-calculated, so that while the user is browsing a page, the browser could calculate tickets suitable to access the next likely browsing targets (e.g. the "next" button).
  The biggest downside I see is that mobile devices would likely suffer. Possible the difficulty of the challange is/should be varied by other metrics, such as the number of requests arriving per time unit from a C-class network etc.
  
  Reply View | 0 replies
- debugnik 4 days ago
  
  That's a matter of increasing the difficulty isn't it? And if the added cost is really negligible, we can just switch to a "refresh" challenge for the same added latency and without burning energy for no reason.
  
  Reply View | 6 replies
  
  Retr0id 4 days ago
  
  If you increase the difficulty much beyond what it currently is, legitimate users end up having to wait for ages.
  
  Reply View | 2 replies
  
  therein 4 days ago
  
  I am guessing you don't realize that that means people using not the latest generation phones will suffer.
  
  Reply View | 1 reply
  
  debugnik 3 days ago
  
  I'm not using the latest generation of phones, not in the slightest, and I don't really care, because the alternative to Anubis-like intersitials is the sites not loading at all when they're mass-crawled to death.
  
  Reply View | 0 replies
  
  [removed] 4 days ago
  
  [deleted]
  
  Reply View | 0 replies
dcminter 4 days ago

> Sadly the site's being hugged to death right now
Luckily someone had already captured an archive snapshot: https://archive.ph/BSh1l

Reply View | 0 replies

skydhash 4 days ago

It's more about the (intentional?) DDoS from AI scrappers, than preventing them from accessing the content. Bandwidth is not cheap.

Reply View 0 replies

unclad5968 4 days ago

Im not on Firefox or any Firefox derivative and I still get anime cat girls making sure I'm not a bot.

Reply View 3 replies

nemomarx 4 days ago

Mozilla is used in the user agent string of all major browsers for historical reasons, but not necessarily headless ones or so on.

Reply View | 2 replies
- unclad5968 4 days ago
  
  Oh that's interesting, I had no idea.
  
  Reply View | 1 reply
  
  seabrookmx 3 days ago
  
  There's some sites[1] that can print your user agent for you. Try it in a few different browsers and you will be surprised. They're honestly unhinged.. I have no idea why we still use this header in 2025!
  [1]: https://dnschecker.org/user-agent-info.php
  
  Reply View | 0 replies

[removed] 4 days ago

[deleted]

Reply View 0 replies

dingnuts 4 days ago

[flagged]

Reply View 17 replies

verteu 4 days ago

> PoW increases the cost for the bots which is great. Trivial to implement, sure, but that added cost will add up quickly.
No, the article estimates it would cost less than a single penny to scrape all pages of 1,000,000 distinct Anubis-guarded websites for an entire month.

Reply View | 5 replies
- thunderfork 4 days ago
  
  Once you've built the system that lets you do that, maybe. You still have to do that, though, so it's still raising the cost floor.
  
  Reply View | 4 replies
  
  vmttmv 3 days ago
  
  but... how? when the author ran the numbers, the rough estimate is solving the challenges at a rate of 10000/5 min, on a single instance of the free tier of google compute. that is an insignificant load at an even more insignificant cost.
  
  Reply View | 3 replies
userbinator 3 days ago

I thought HN was anti-copyright and anti-imaginary-property, or at least the bulk of its users were. Yet all of a sudden, "but AI!!!!1"?
a federal crime
The rest of the world doesn't care.

Reply View | 1 reply
- klabb3 3 days ago
  
  > I thought HN was anti-copyright
  Maybe. But what’s happening is ”copyright for thee not for me”, not a universal relaxation of copyright. This loophole exploitation by behemoths doesn’t advance any ideological goals, it only inflames the situation because now you have an adversarial topology. You can see this clearly in practice – more and more resources are going into defense and protection of data than ever before. Fingerprinting, captchas, paywalls, login walls, etc etc.
  
  Reply View | 0 replies
altairprime 4 days ago

Don’t forget signed attestations from “user probably has skin in the game” cloud providers like iCloud (already live in Safari and accepted by Cloudflare, iirc?) — not because they identify you but because abusive behavior will trigger attestation provider rate limiting and termination of services (which, in Apple’s case, includes potentially a console kill for the associated hardware). It’s not very popular to discuss at HN but I bet Anubis could add support for it regardless :)
https://datatracker.ietf.org/wg/privacypass/about/
https://www.w3.org/TR/vc-overview/

Reply View | 0 replies
shkkmo 4 days ago

> PoW increases the cost for the bots which is great.
But not by any meaningful amount as explained in the article. All it actually does is rely on it's obscurity while interfering with legitimate use.

Reply View | 0 replies
nialv7 4 days ago

> Fuck AI scrapers, and fuck all this copyright infringement at scale.
Yes, fuck them. Problem is Anubis here is not doing the job. As the article already explains, currently Anubis is not adding a single cent to the AI scrappers' costs. For Anubis to become effective against scrappers, it will necessarily have to become quite annoying for legitimate users.

Reply View | 6 replies
- Gibbon1 4 days ago
  
  Best response to AI scrapers is to poison their models.
  
  Reply View | 5 replies
  
  nemomarx 4 days ago
  
  how well is modern poisoning holding up?
  
  Reply View | 4 replies