Comment by eqvinox
TFA — and most comments here — seem to completely miss what I thought was the main point of Anubis: it counters the crawler's "identity scattering"/sybil'ing/parallel crawling.
Any access will fall into either of the following categories:
- client with JS and cookies. In this case the server now has an identity to apply rate limiting to, from the cookie. Humans should never hit it, but crawlers will be slowed down immensely or ejected. Of course the identity can be rotated — at the cost of solving the puzzle again.
- amnesiac (no cookies) clients with JS. Each access is now expensive.
(- no JS - no access.)
The point is to prevent parallel crawling and overloading the server. Crawlers can still start an arbitrary number of parallel crawls, but each one costs to start and needs to stay below some rate limit. Previously, the server would collapse under thousands of crawler requests per second. That is what Anubis is making prohibitively expensive.
Yes, I think you're right. The commentary about its (presumed, imagined) effectiveness is very much making the assumption that it's designed to be an impenetrable wall[0] -- i.e. prevent bots from accessing the content entirely.
I think TFA is generally quite good and has something of a good point about the economics of the situation, but finding the math shake out that way should, perhaps, lead one to question their starting point / assumptions[1].
In other words, who said the websites in question wanted to entirely prevent crawlers from accessing them? The answer is: no one. Web crawlers are and have been fundamental to accessing the web for decades. So why are we talking about trying to do that?
[0] Mentioning 'impenetrable wall' is probably setting off alarm bells, because of course that would be a bad design.
[1] (Edited to add:) I should say 'to question their assumptions more' -- like I said, the article is quite good and it does present this as confusing, at least.