Comment by tigranbs

Comment by tigranbs 2 days ago

When I write the scraper, I literally can't write it to account for the API for every single website! BUT I can write how to parse HTML universally, so it is better to find a way to cache your website's HTML so you're not bombarded, rather than write an API and hope companies will spend time implementing it!

dotancohen 2 days ago

If you are writing a scraper it behooves you to understand the website that you are scraping. WordPress websites, like that the author is discussing, provide such an API out of the box. And like all WordPress features, this feature is hardly ever disabled or altered by the website administrators.

And identifying a WordPress website is very easy by looking at the HTML. Anybody experienced in writing web scrapers has encountered it many times.

Reply View 5 replies

Y-bar 2 days ago

> If you are writing a scraper it behooves you to understand the website that you are scraping.
That’s what semantic markup is for? No? H1…n:s, article:s, nav:s, footer:s (and microdata even) and all that helps both machines and humans to understand what parts of the content to care about in certain contexts.
Why treat certain CMS:s different when we have the common standard format HTML?

Reply View | 0 replies
estimator7292 2 days ago

What if your target isn't any WordPress website, but any website?
It's simply not possible to carefully craft a scraper for every website on the entire internet.
Whether or not one should scrape all possible websites is a separate question. But if that is one's goal, the one and only practical way is to just consume HTML straight.

Reply View | 0 replies
pavel_lishin 2 days ago

If you are designing a car, it behooves you to understand the driveway of your car's purchaser.

Reply View | 2 replies
- dotancohen a day ago
  
  Web scrapers are typically custom written to fit the site they are scraping. Very few motor vehicles are commissioned for a specific purchaser - fewer still to the design of that purchaser.
  
  Reply View | 1 reply
  
  pavel_lishin a day ago
  
  I have a hard time believing that the scrapers that are feeding data into the big AI companies are custom-written on a per-page basis.
  
  Reply View | 0 replies

ronsor 2 days ago

WordPress is common enough that it's worth special-casing.

WordPress, MediaWiki, and a few other CMSes are worth implementing special support for just so scraping doesn't take so long!

Reply View 0 replies

swiftcoder 2 days ago

> BUT I can write how to parse HTML universally

Can you though? Because even big companies rarely manage to do so - as a concrete example, neither Apple nor Mozilla apparently has sufficient resources to produce a reader mode that can reliably find the correct content elements in arbitrary HTML pages.

Reply View 0 replies

jarofgreen 2 days ago

> so it is better to find a way to cache your website's HTML so you're not bombarded

Of course, scrapers should identify themselves and then respect robots.txt.

Reply View 0 replies

contravariant 2 days ago

Why is figuring out what UI elements to capture so much harder than just looking at the network activity to figure what API calls you need?

Reply View 0 replies

DocTomoe 2 days ago

Oh, it is my responsibility to work around YOUR preferred way of doing things, when I have zero benefit from it?

Maybe I just get your scraper's IP range and start poisoning it with junk instead?

Reply View 0 replies

themafia 2 days ago

[flagged]

Reply View 0 replies