Comment by Philpax
Comment by Philpax 4 days ago
Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163
Comment by Philpax 4 days ago
Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163
My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.
It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.
Earlier today I found we'd served over a million requests to over 500,000 different IPs.
All had the same user agent (current Safari), they seem to be from hacked computers as the ISPs are all over the world.
The structure of the requests almost certainly means we've been specifically targeted.
But it's also a valid query, reasonably for normal users to make.
From this article, it looks like Proof of Work isn't going to be the solution I'd hoped it would be.
The math in the article assumes scrapers only need one Anubis token per site, whereas a scraper using 500,000 IPs would require 500,000 tokens.
Scaling up the math in the article, which states it would take 6 CPU-minutes to generate enough tokens to scrape 11,508 Anubis-using websites, we're now looking at 4.3 CPU-hours to obtain enough tokens to scrape your website (and 50,000 CPU-hours to scrape the Internet). This still isn't all that much -- looking at cloud VM prices, that's around 10c to crawl your website and $1000 to crawl the Internet, which doesn't seem like a lot but it's much better than "too low to even measure".
However, the article observes Anubis's default difficulty can be solved in 30ms on a single-core server CPU. That seems unreasonably low to me; I would expect something like a second to be a more appropriate difficulty. Perhaps the server is benefiting from hardware accelerated sha256, whereas Anubis has to be fast enough on clients without it? If it's possible to bring the JavaScript PoW implementation closer to parity with a server CPU (maybe using a hash function designed to be expensive and hard to accelerate, rather than one designed to be cheap and easy to accelerate), that would bring the cost of obtaining 500k tokens up to 138 CPU-hours -- about $2-3 to crawl one site, or around $30,000 to crawl all Anubis deployments.
I'm somewhat skeptical of the idea of Anubis -- that cost still might be way too low, especially given the billions of VC dollars thrown at any company with "AI" in their sales pitch -- but I think the article is overly pessimistic. If your goal is not to stop scrapers, but rather to incentivize scrapers to be respectful by making it cheaper to abide by rate limits than it is to circumvent them, maybe Anubis (or something like it) really is enough.
(Although if it's true that AI companies really are using botnets of hacked computers, then Anubis is totally useless against bots smart enough to solve the challenges since the bots aren't paying for the CPU time.)
If the scraper scrapes from a small number of IPs they're easy to block or rate-limit. Rate-limits against this behaviour are fairly easy to implement, as are limits against non-human user agents, hence the botnet with browser user agents.
The Duke University Library analysis posted elsewhere in the discussion is promising.
I'm certain the botnets are using hacked/malwared computers, as the huge majority of requests come from ISPs and small hosting providers. It's probably more common for this to be malware, e.g. a program that streams pirate TV, or a 'free' VPN app, which joins the user's device to a botnet.
Why haven't they been sued and jailed for DDoS, which is a felony?
Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.
coming from a different legal system so please forgive my ignorance: Is it necessary in the US to prove ill intent in order to sue for repairs? Just wondering, because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
>Is it necessary in the US to prove ill intent in order to sue for repairs?
As a general rule of thumb: you can sue anyone for anything in the US. There are even a few cases where someone tried to sue God: https://en.wikipedia.org/wiki/Lawsuits_against_supernatural_...
When we say "do we need" or "can we do" we're talking about the idea of how plausible it is to win case. A lawyer won't take a case with bad odds of winning, even if you want to pay extra because a part of their reputation lies on taking battles they feel they can win.
>because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
IANAL, so the boring answer is "it depends". reparations aren't guaranteed, but there's 50 different state laws to consider, on top of federal law.
Generally, they are not entitled to pay for damages themselves, but they may possibly be charged with battery. Intent will be a strong factor in winning the case.
There's an angle where criminal intent doesn't matter when it comes to negligence and damages. They have to had known that their scrapers would cause denial of service, unauthorized access, increased costs for operators, etc.
That's not a certain outcome. If you're willing to do this case, I can provide access logs and any evidence you want. You can keep any money you win plus I'll pay a bonus on top! Wanna do it?
Keep in mind I'm in Germany, the server is in another EU country, and the worst scrapers overseas (in China, USA, and Singapore). Thanks to these LLMs there is no barrier to have the relevant laws be translated in all directions I trust that won't be a problem! :P
> criminal intent doesn't matter when it comes to negligence and damages
Are you a criminal defense attorney or prosecutor?
> They have to had known
IMO good luck convincing a judge of that... especially "beyond a reasonable doubt" as would be required for criminal negligence. They could argue lots of other scrapers operate just fine without causing problems, and that they tested theirs on other sites without issue.
I thought only capital crimes (murder, for example) held the standard of beyond a reasonable doubt. Lesser crimes require the standard of either a "Preponderance of Evidence" or "Clear and Convincing Evidence" as burden of proof.
Still, even by those lesser standards, it's hard to build a case.
It's civil cases that have the lower standard of proof. Civil cases arise when one party sues another, typically seeking money, and they are claims in equity, where the defendant is alleged to have harmed the plaintiff in some way.
Criminal cases require proof beyond a reasonable doubt. Most things that can result in jail time are criminal cases. Criminal cases are almost always brought by the government, and criminal acts are considered harm to society rather than to (strictly) an individual. In the US, criminal cases are classified as "misdemeanors" or "felonies," but that language is not universal in other jurisdictions.
No, all criminal convictions require proof beyond a reasonable doubt: https://constitution.congress.gov/browse/essay/amdt14-S1-5-5...
>Absent a guilty plea, the Due Process Clause requires proof beyond a reasonable doubt before a person may be convicted of a crime.
Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?