Comment by dotancohen
Comment by dotancohen 2 days ago
If you are writing a scraper it behooves you to understand the website that you are scraping. WordPress websites, like that the author is discussing, provide such an API out of the box. And like all WordPress features, this feature is hardly ever disabled or altered by the website administrators.
And identifying a WordPress website is very easy by looking at the HTML. Anybody experienced in writing web scrapers has encountered it many times.
> If you are writing a scraper it behooves you to understand the website that you are scraping.
That’s what semantic markup is for? No? H1…n:s, article:s, nav:s, footer:s (and microdata even) and all that helps both machines and humans to understand what parts of the content to care about in certain contexts.
Why treat certain CMS:s different when we have the common standard format HTML?