Comment by gundmc
Comment by gundmc 13 hours ago
What do you mean by "barely" respecting robots.txt? Wouldn't that be more binary? Are they respecting some directives and ignoring others?
Comment by gundmc 13 hours ago
What do you mean by "barely" respecting robots.txt? Wouldn't that be more binary? Are they respecting some directives and ignoring others?
Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
This is highly annoying and rude. Is there a complete list of all known bots and crawlers?
Amazonbot doesn't respect the `Crawl-Delay` directive. To be fair, Crawl-Delay is non-standard, but it is claimed to be respected by the other 3 most aggressive crawlers I see.
And how often does it check robots.txt? ClaudeBot will make hundreds of thousands of requests before it re-checks robots.txt to see that you asked it to please stop DDoSing you.
Here's Google, complaining of problems with pages they want to index but I blocked with robots.txt.
New reason preventing your pages from being indexed
Search Console has identified that some pages on your site are not being indexed
due to the following new reason:
Indexed, though blocked by robots.txt
If this reason is not intentional, we recommend that you fix it in order to get
affected pages indexed and appearing on Google.
Open indexing report
Message type: [WNC-20237597]
I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.
That counts as barely imho.
I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.