Comment by trebor

Comment by trebor 14 hours ago

Upvoted because we’re seeing the same behavior from all AI and Seo bots. They’re BARELY respecting Robots.txt, and hard to block. And when they crawl, they spam and drive up load so high they crash many servers for our clients.

If AI crawlers want access they can either behave, or pay. The consequence will almost universal blocks otherwise!

herpdyderp 13 hours ago

> The consequence will almost universal blocks otherwise!

How? The difficulty of doing that is the problem, isn't it? (Otherwise we'd just be doing that already.)

Reply View 1 reply

ADeerAppeared 11 hours ago

> (Otherwise we'd just be doing that already.)
Not quite what the original commenter meant but: WE ARE.
A major consequence of this reckless AI scraping is that it turbocharged the move away from the web and into closed ecosystems like Discord. Away from the prying eyes of most AI scrapers ... and the search engine indexes that made the internet so useful as an information resource.
Lots of old websites & forums are going offline as their hosts either cannot cope with the load or send a sizeable bill to the webmaster who then pulls the plug.

Reply View | 0 replies

gundmc 13 hours ago

What do you mean by "barely" respecting robots.txt? Wouldn't that be more binary? Are they respecting some directives and ignoring others?

Reply View 13 replies

unsnap_biceps 13 hours ago

I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.
That counts as barely imho.
I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.

Reply View | 10 replies
- joecool1029 12 hours ago
  
  Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
  
  Reply View | 6 replies
  
  SR2Z 11 hours ago
  
  IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.
  
  Reply View | 4 replies
  
  AnonC 7 hours ago
  
  As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.
  
  Reply View | 0 replies
- noman-land 12 hours ago
  
  This is highly annoying and rude. Is there a complete list of all known bots and crawlers?
  
  Reply View | 1 reply
  
  jsheard 12 hours ago
  
  https://darkvisitors.com/agents
  https://github.com/ai-robots-txt/ai.robots.txt
  
  Reply View | 0 replies
- [removed] 12 hours ago
  
  [deleted]
  
  Reply View | 0 replies
LukeShu 12 hours ago

Amazonbot doesn't respect the `Crawl-Delay` directive. To be fair, Crawl-Delay is non-standard, but it is claimed to be respected by the other 3 most aggressive crawlers I see.
And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.

Reply View | 0 replies

Animats 11 hours ago

Here's Google, complaining of problems with pages they want to index but I blocked with robots.txt.

    New reason preventing your pages from being indexed

    Search Console has identified that some pages on your site are not being indexed 
    due to the following new reason:

        Indexed, though blocked by robots.txt

    If this reason is not intentional, we recommend that you fix it in order to get
    affected pages indexed and appearing on Google.
    Open indexing report
    Message type: [WNC-20237597]

Reply View 0 replies

ksec 13 hours ago

Is there some way website can sell those Data to AI bot in a large zip file rather than being constantly DDoS?

Or they could at least have the curtesy to scrap during night time / off peak hours.

Reply View 3 replies

jsheard 13 hours ago

No, because they won't pay for anything they can get for free. There's only one situation where an AI company will pay for data, and that's when it's owned by someone with scary enough lawyers to pressure them into paying up. Hence why OpenAI has struck licensing deals with a handful of companies while continuing to bulk-scrape unlicensed data from everyone else.

Reply View | 0 replies
tredre3 9 hours ago

There is project whose goal is to avoid this crawling-induced DDoS by maintaining a single web index: https://commoncrawl.org/

Reply View | 0 replies
seethenerdz 4 hours ago

Is existing intellectual property law not sufficient? Why aren't companies being prosecuted for large-scale theft?

Reply View | 0 replies

Vampiero 13 hours ago

> The consequence will almost universal blocks otherwise!

Who cares? They've already scraped the content by then.

Reply View 4 replies

jsheard 13 hours ago

Bold to assume that an AI scraper won't come back to download everything again, just in case there's any new scraps of data to extract. OP mentioned in the other thread that this bot had pulled 3TB so far, and I doubt their git server actually has 3TB of unique data, so the bot is probably pulling the same data over and over again.

Reply View | 1 reply
- xena 13 hours ago
  
  FWIW that includes other scrapers, Amazon's is just the one that showed up the most in the logs.
  
  Reply View | 0 replies
_heimdall 12 hours ago

If they only needed a one-time scrape we really wouldn't be seeing noticeable not traffic today.

Reply View | 0 replies
seethenerdz 4 hours ago

That's the spirit!

Reply View | 0 replies

emmelaich 9 hours ago

If they're AI bots it might be fun to feed them nonsense. Just send hack megabytes of "Bezos is a bozo" or something like that. Even more fun if you could cooperate with many other otherwise-unrelated websites, e.g. via time settings in a modified tarpit.

Reply View 0 replies

mschuster91 13 hours ago

Global tarpit is the solution. It makes sense anyway even without taking AI crawlers into account. Back when I had to implement that, I went the semi manual route - parse the access log and any IP address averaging more than X hits a second on /api gets a -j TARPIT with iptables [1].

Not sure how to implement it in the cloud though, never had the need for that there yet.

[1] https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62f...

Reply View 6 replies

jks 13 hours ago

One such tarpit (Nepenthes) was just recently mentioned on Hacker News: https://news.ycombinator.com/item?id=42725147
Their site is down at the moment, but luckily they haven't stopped Wayback Machine from crawling it: https://web.archive.org/web/20250117030633/https://zadzmo.or...

Reply View | 2 replies
- marcus0x62 12 hours ago
  
  Quixotic[0] (my content obfuscator) includes a tarpit component, but for something like this, I think the main quixotic tool would be better - you run it against your content once, and it generates a pre-obfuscated version of it. It takes a lot less of your resources to serve than dynamically generating the tarpit links and content.
  0 - https://marcusb.org/hacks/quixotic.html
  
  Reply View | 0 replies
- kazinator 12 hours ago
  
  How do you know their site is down? You probably just hit their tarpit. :)
  
  Reply View | 0 replies
bwfan123 12 hours ago

i would think public outcry by influencers on social media (such as this thread) is a better deterrent, and also establishes a public datapoint and exhibit for future reference.. as it is hard to scale the tarpit.

Reply View | 0 replies
seethenerdz 4 hours ago

Don't we have intellectual property law for this tho?

Reply View | 0 replies
idlewords 11 hours ago

This doesn't work with the kind of highly distributed crawling that is the problem now.

Reply View | 0 replies

seethenerdz 4 hours ago

Don't worry, though, because IP law only applies to peons like you and me. :)

Reply View 0 replies