Comment by xnyan

Comment by xnyan a day ago

1 reply

I think there's some misunderstanding? Sometimes parsing HTML is the best way to get what you need, however there are many situations where one must use something like playwright to get the HTML in the first place (for example, the html is generated clientside by javascript). What's the better alternative?

mdaniel 12 hours ago

Yes, there is for sure some misunderstanding. Of course parsing HTML is the best way to get what you need in a thread about screen scraping using browser automation. And if the target site is the modern bloatware of <html><body><script src=/17gigabytes.js></script></body></html> then for sure one needs a browser (or equivalent) to solve that problem

What I'm saying is that doing the equivalent of

  chrome.exe --dump-html https://example.com/lol \
    | python -c "import bs4; print('reevaluate life choices that led you here')"
is just facepalm stupid. The first step by definition has already parsed all the html (and associated resources) into a very well formed data structure and then makes available THREE selector languages (DOM, CSS, XPath) to reach into that data structure and pull out the things which interest you. BS4 and its silly python friends implement only a small fraction of those selector languages, poorly. So it's fine if a hammer is all you have, but to launch Chrome and then revert to bs4 is just "what problem are you solving here, friend?"