Comment by canu7

Of course you wouldn't do it for every single page. If I was designing this crawler I'd make it sample a percentage of pages, starting at 100% sample rate for a completely unknown website, decreasing the sample rate over time as more "good" pages are found relative to "bad" pages.

After a "good" page percentage threshold is exceeded, stop sampling entirely and just crawl, assuming that all content is good. After a "bad" page percentage threshold is exceeded just stop wasting your time crawling that domain entirely.

With modern models the sampling cost should be quite cheap, especially since Nepenthes has a really small page size. Now if the page was humungous that might make it harder and more expensive to put through an LLM