Comment by NathanKP

Comment by NathanKP 9 months ago

4 replies

This looks extremely easy to detect and filter out. For example: https://i.imgur.com/hpMrLFT.png

In short, if the creator of this thinks that it will actually trick AI web crawlers, in reality it would take about 5 mins of time to write a simple check that filters out and bans the site from crawling. With modern LLM workflows its actually fairly simple and cheap to burn just a little bit of GPU time to check if the data you are crawling is decent.

Only a really, really bad crawl bot would fall for this. The funny thing is that in order to make something that an AI crawler bot would actually fall for you'd have to use LLM's to generate realistic enough looking content. Markov chain isn't going to cut it.

slongfield 9 months ago

The most annoying bots are the ones that mindlessly slam sites over and over, without doing any filtering. Having these kinds of tarpits out in the wild forcing people to be better behaved with their crawling bots is a feature, not a bug.

canu7 9 months ago

If they need to query a trained LLM for each page they crawl, I would guess that the training cost would scale up pretty badly...

  • NathanKP 9 months ago

    Of course you wouldn't do it for every single page. If I was designing this crawler I'd make it sample a percentage of pages, starting at 100% sample rate for a completely unknown website, decreasing the sample rate over time as more "good" pages are found relative to "bad" pages.

    After a "good" page percentage threshold is exceeded, stop sampling entirely and just crawl, assuming that all content is good. After a "bad" page percentage threshold is exceeded just stop wasting your time crawling that domain entirely.

    With modern models the sampling cost should be quite cheap, especially since Nepenthes has a really small page size. Now if the page was humungous that might make it harder and more expensive to put through an LLM

    • krior 9 months ago

      > After a "bad" page percentage threshold is exceeded just stop wasting your time crawling that domain entirely.

      In the words of Bush jr.: Mission accomplished!