Comment by tigranbs

Comment by tigranbs 2 days ago

12 replies

When I write the scraper, I literally can't write it to account for the API for every single website! BUT I can write how to parse HTML universally, so it is better to find a way to cache your website's HTML so you're not bombarded, rather than write an API and hope companies will spend time implementing it!

dotancohen 2 days ago

If you are writing a scraper it behooves you to understand the website that you are scraping. WordPress websites, like that the author is discussing, provide such an API out of the box. And like all WordPress features, this feature is hardly ever disabled or altered by the website administrators.

And identifying a WordPress website is very easy by looking at the HTML. Anybody experienced in writing web scrapers has encountered it many times.

  • Y-bar 2 days ago

    > If you are writing a scraper it behooves you to understand the website that you are scraping.

    That’s what semantic markup is for? No? H1…n:s, article:s, nav:s, footer:s (and microdata even) and all that helps both machines and humans to understand what parts of the content to care about in certain contexts.

    Why treat certain CMS:s different when we have the common standard format HTML?

  • estimator7292 2 days ago

    What if your target isn't any WordPress website, but any website?

    It's simply not possible to carefully craft a scraper for every website on the entire internet.

    Whether or not one should scrape all possible websites is a separate question. But if that is one's goal, the one and only practical way is to just consume HTML straight.

  • pavel_lishin 2 days ago

    If you are designing a car, it behooves you to understand the driveway of your car's purchaser.

    • dotancohen a day ago

      Web scrapers are typically custom written to fit the site they are scraping. Very few motor vehicles are commissioned for a specific purchaser - fewer still to the design of that purchaser.

      • pavel_lishin a day ago

        I have a hard time believing that the scrapers that are feeding data into the big AI companies are custom-written on a per-page basis.

ronsor 2 days ago

WordPress is common enough that it's worth special-casing.

WordPress, MediaWiki, and a few other CMSes are worth implementing special support for just so scraping doesn't take so long!

swiftcoder 2 days ago

> BUT I can write how to parse HTML universally

Can you though? Because even big companies rarely manage to do so - as a concrete example, neither Apple nor Mozilla apparently has sufficient resources to produce a reader mode that can reliably find the correct content elements in arbitrary HTML pages.

jarofgreen 2 days ago

> so it is better to find a way to cache your website's HTML so you're not bombarded

Of course, scrapers should identify themselves and then respect robots.txt.

contravariant 2 days ago

Why is figuring out what UI elements to capture so much harder than just looking at the network activity to figure what API calls you need?

DocTomoe 2 days ago

Oh, it is my responsibility to work around YOUR preferred way of doing things, when I have zero benefit from it?

Maybe I just get your scraper's IP range and start poisoning it with junk instead?