Comment by shakna

Comment by shakna 10 hours ago

2 replies

When I was writing a crawler for my search engine (now offline), I found almost no crawler library actually compliant with the real world. So I ended up going to a lot of effort to write one that complied with Amazon and Google's rather complicated nested robots files, including respecting the cool off periods as requested.

... And then found their own crawlers can't parse their own manifests.

bb010g 10 hours ago

Could you link the source of your crawler library?

  • shakna 3 hours ago

    It's about 700 lines of the worst Python ever. You do not want it. I would be too embarrassed to release it, honestly.

    It complied, but it was absolutely not fast or efficient. I aimed at compliance first, good code second, but never got to the second because of more human-oriented issues that killed the project.