Comment by joecool1029

Comment by joecool1029 a year ago

Even some non-profit ignore it now, Internet Archive stopped respecting it years ago: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

SR2Z a year ago

IA actually has technical and moral reasons to ignore robots.txt. Namely, they want to circumvent this stuff because their goal is to archive EVERYTHING.

Reply View 5 replies

prinny_ a year ago

Isn’t this a weak argument? OpenAI could also say their goal is to learn everything, feed it to AI, advance humanity etc etc.

Reply View | 3 replies
- compootr a year ago
  
  OAI is using others' work to resell it in models. IA uses it to presrrve the history of the web
  there is a case to be made about the value of the traffic you'll get from oai search though...
  
  Reply View | 1 reply
  
  [removed] a year ago
  
  [deleted]
  
  Reply View | 0 replies
- SR2Z a year ago
  
  It does depend a lot on how you feel about IA's integrity :P
  
  Reply View | 0 replies
amarcheschi a year ago

I also don't think they hit servers repeatedly so much

Reply View | 0 replies

AnonC a year ago

As I recall, this is outdated information. Internet Archive does respect robots.txt and will remove a site from its archive based on robots.txt. I have done this a few years after your linked blog post to get an inconsequential site removed from archive.org.

Reply View 1 reply

dredmorbius a year ago

The most recent notice IA have blogged was in 2017, and there's no indication that the service has reversed course on robots.txt since.
<https://blog.archive.org/?s=robots.txt>

Reply View | 0 replies