Comment by akst

Sympathies to the author, sounds like he's talking about crawlers, although I do write scrapers from time to time. I'm probably not the type of person to scrape his blog, while it sounds like he's probably gone to lengths to make it useful, if I've resorted to scrapeing something it's because I never saw the API, or I saw it and I assumed it was locked down and missing a bunch of useful information.

Also if I'm ingesting something from an API it means I write code specific to that API to ingest it (god forbid I have to get an API token, although in the authors case it doesn't sound like it), where as with HTML, it's often a matter of go to this selector, figure out what are the land mark headings, the body copy and what is noise. Which is easier to generalise, if I'm consuming content from many sources.

I can only imagine it's no easier for a crawler, they're probably crawling thousands of sites and this guys website is a pitstop. Maybe an LLVM can figure out how to generalise it, but surely a crawler has limited the role of the AI to reading output and deciding which links to explore next. IDK maybe it is trivial and costless, but the fact it's not already being done shows it probably requires time and resources to setup and it might be cheaper to continue to interpret the imperfect HTML.