Comment by dotancohen

Comment by dotancohen 2 days ago

5 replies

If you are writing a scraper it behooves you to understand the website that you are scraping. WordPress websites, like that the author is discussing, provide such an API out of the box. And like all WordPress features, this feature is hardly ever disabled or altered by the website administrators.

And identifying a WordPress website is very easy by looking at the HTML. Anybody experienced in writing web scrapers has encountered it many times.

Y-bar 2 days ago

> If you are writing a scraper it behooves you to understand the website that you are scraping.

That’s what semantic markup is for? No? H1…n:s, article:s, nav:s, footer:s (and microdata even) and all that helps both machines and humans to understand what parts of the content to care about in certain contexts.

Why treat certain CMS:s different when we have the common standard format HTML?

estimator7292 2 days ago

What if your target isn't any WordPress website, but any website?

It's simply not possible to carefully craft a scraper for every website on the entire internet.

Whether or not one should scrape all possible websites is a separate question. But if that is one's goal, the one and only practical way is to just consume HTML straight.

pavel_lishin 2 days ago

If you are designing a car, it behooves you to understand the driveway of your car's purchaser.

  • dotancohen a day ago

    Web scrapers are typically custom written to fit the site they are scraping. Very few motor vehicles are commissioned for a specific purchaser - fewer still to the design of that purchaser.

    • pavel_lishin a day ago

      I have a hard time believing that the scrapers that are feeding data into the big AI companies are custom-written on a per-page basis.