Comment by dataviz1000

Comment by dataviz1000 2 days ago

3 replies

The issue with diffing html is selectors are autogenerated with any update to a website's code. Often website which combat scraping will autogenerate different HTML. First thing is to screen caption a website for comparison. Second, it is possible to determine all the visible elements on a page. With Playwright, inject event listeners to all elements on a page and start automated clicking. If the agent fills out forms, then make sure that all fields are available to populate. There are a lot of heuristics.

thestepafter 2 days ago

Are you doing screenshot comparison with Playwright? If so, how? Based on my research this looks to be a missing feature but I could be incorrect.

  • sahmeepee 2 days ago

    Playwright has screenshot comparison built in, including screenshotting a single element, blanking specific elements, and comparing the textual aspects of elements without a visual comparison. You can even apply a specific stylesheet for comparisons.

    Everything I can see in this demo can be done with Playwright on its own or with some very basic infrastructure e.g. from Azure to run the tests (automations). I can't see what it is adding. Is it doing some bot-detection countermeasures?

    Checking if the page behaviour has changed is pretty easy in Playwright because its primary purpose is testing, so just write some tests to assert the behaviour you expect before you use it.

    We use Playwright to both automate and scrape the site of a public organisation we are obliged to use, as another public body. They do have some bot detection because we get an email when we run the scripts, asking us to confirm our account hasn't been compromised, but so far we have not been blocked. If they ever do block us we will need to hire someone to do manual data entry, but the automation has already paid for itself many times over in a couple of years.

    • dataviz1000 2 days ago

      Some ideas. First, are you saving the cookies and adding them when Playwright bootstraps? [0] Second, are you using the same IP address? Or better use a server running from your office or someone's house. Those are the big ones. The first prevents you from having to continuously login.

      It is a game of cat and mouse. It is impossible to stop someone determined to circumvent bot protections.

      [0] https://playwright.dev/docs/api/class-browsercontext#browser...