Comment by ghxst

Comment by ghxst 2 days ago

2 replies

In python specifically I like lxml (pretty sure that's what BS uses under the hood?), parse5 if you're using node is usually my go to. Ideally though you shouldn't really have to parse anything (or not much at all) when doing browser automation as you have access to the DOM which gives you an interface that accepts query selectors directly (you don't even need the Runtime domain for most of your needs).

mdaniel 2 days ago

> pretty sure that's what BS uses under the hood?

it's an option[1], and my strong advice is to not use lxml for html since html5lib[2] has the explicitly stated goal of being WHATWG compliant: https://github.com/html5lib/html5lib-python#html5lib

1: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#insta...

2: https://pypi.org/project/html5lib/

  • ghxst 2 days ago

    That's good to know, will try it out. I haven't had many cases of "broken" html in projects where I use lxml but when they do happen it can definitely be a pain.