Comment by unsnap_biceps

Comment by unsnap_biceps 13 hours ago

10 replies

I believe that a number of AI bots only respect robot.txt entries that explicitly define their static user agent name. They ignore wildcards in user agents.

That counts as barely imho.

I found this out after OpenAI was decimating my site and ignoring the wildcard deny all. I had to add entires specifically for their three bots to get them to stop.

joecool1029 12 hours ago

Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

  • SR2Z 11 hours ago

    IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.

    • prinny_ 9 hours ago

      Isn’t this a weak argument? OpenAI could also say their goal is to learn everything, feed it to AI, advance humanity etc etc.

      • compootr 9 hours ago

        OAI is using others' work to resell it in models. IA uses it to presrrve the history of the web

        there is a case to be made about the value of the traffic you'll get from oai search though...

        • [removed] an hour ago
          [deleted]
    • amarcheschi 10 hours ago

      I also don't think they hit servers repeatedly so much

  • AnonC 7 hours ago

    As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.

[removed] 12 hours ago
[deleted]