Comment by mdaniel

Comment by mdaniel 10 months ago

> Finic uses Playwright to interact with DOM elements, and recommends BeautifulSoup for HTML parsing.

I have never, ever understood anyone who goes to the trouble of booting up a browser, and then uses a python library to do static HTML parsing

Anyway, I was surfing around the repo trying to find what, exactly "Safely store and access credentials using Finic’s built-in secret manager" means

ayanb9440 10 months ago

We're in the middle of putting this together right now but it's going to be a wrapper around Google Secret Manager for those that don't want to set up a secrets manager themselves.

Reply View 0 replies

0x3444ac53 10 months ago

Often times websites won't load the HTML without executing the JavaScript. or uses JavaScript running client side to generate the entire page.

Reply View 1 reply

mdaniel 10 months ago

I feel that we are in agreement for the cases where one would use Playwright, and for damn sure would not involve BS4 for anything in that case

Reply View | 0 replies

msp26 10 months ago

What would you recommend for parsing instead?

Reply View 6 replies

mdaniel 10 months ago

In this specific scenario, where the project is using *automated Chrome* to even bother with the connection, redirects, and bazillions of other "browser-y" things to arrive at HTML to be parsed, the very idea that one would `soup = BeautifulSoup(playright.content())` is crazypants to me
I am open to the fact that html5lib strives to parse correctly, and good for them, but that would be the case where one wished to use python for parsing to avoid the pitfalls of dragging a native binary around with you

Reply View | 2 replies
- xnyan 10 months ago
  
  I think there's some misunderstanding? Sometimes parsing HTML is the best way to get what you need, however there are many situations where one must use something like playwright to get the HTML in the first place (for example, the html is generated clientside by javascript). What's the better alternative?
  
  Reply View | 1 reply
  
  mdaniel 10 months ago
  
  Yes, there is for sure some misunderstanding. Of course parsing HTML is the best way to get what you need in a thread about screen scraping using browser automation. And if the target site is the modern bloatware of <html><body><script src=/17gigabytes.js></script></body></html> then for sure one needs a browser (or equivalent) to solve that problem
  What I'm saying is that doing the equivalent of
  chrome.exe --dump-html https://example.com/lol \ | python -c "import bs4; print('reevaluate life choices that led you here')"
  is just facepalm stupid. The first step by definition has already parsed all the html (and associated resources) into a very well formed data structure and then makes available THREE selector languages (DOM, CSS, XPath) to reach into that data structure and pull out the things which interest you. BS4 and its silly python friends implement only a small fraction of those selector languages, poorly. So it's fine if a hammer is all you have, but to launch Chrome and then revert to bs4 is just "what problem are you solving here, friend?"
  
  Reply View | 0 replies
ghxst 10 months ago

In python specifically I like lxml (pretty sure that's what BS uses under the hood?), parse5 if you're using node is usually my go to. Ideally though you shouldn't really have to parse anything (or not much at all) when doing browser automation as you have access to the DOM which gives you an interface that accepts query selectors directly (you don't even need the Runtime domain for most of your needs).

Reply View | 2 replies
- mdaniel 10 months ago
  
  > pretty sure that's what BS uses under the hood?
  it's an option[1], and my strong advice is to not use lxml for html since html5lib[2] has the explicitly stated goal of being WHATWG compliant: https://github.com/html5lib/html5lib-python#html5lib
  1: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insta...
  2: https://pypi.org/project/html5lib/
  
  Reply View | 1 reply
  
  ghxst 10 months ago
  
  That's good to know, will try it out. I haven't had many cases of "broken" html in projects where I use lxml but when they do happen it can definitely be a pain.
  
  Reply View | 0 replies