Comment by mpalmer

Comment by mpalmer 6 months ago

This looks very cool and makes a lot of sense, except for the idea that it should take the place of Playwright et al.

Personally I'd love to use this as an intermediate workflow for producing deterministic playwright code, but it looks like this is intended for running directly.

I don't think I could plausibly argue for using LLMs at runtime in our test suite at work...

Klaster_1 6 months ago

It's funny you mentioned "deterministic Playwright code," because in my experience, that’s one of the most frustrating challenges of writing integration tests with browser automation tools. Authoring tests is relatively easy, but creating reliable, deterministic tests is much harder.

Most of my test failures come down to timing issues—CPU load subtly affects execution, leading to random timeouts. This makes it difficult to run tests both quickly and consistently. While proactive load-testing of the test environment and introducing artificial random delays during test authoring can help, these steps often end up taking more time than writing the tests themselves.

It would be amazing if tools were smart enough to detect these false positives automatically. After all, if a human can spot them, shouldn’t AI be able to as well?

Reply View 3 replies

ffsm8 6 months ago

I was working on a side project over the holidays with the (I think) same idea as mpalmer imagined there too (though my project wouldn't be interested to him either, because my goal wasn't automating tests)
Basically, the goal would be to do it like with screenshot regression tests: basically you get 2 different execution phases: - generate - verify
And when verify fails in CI, you can automatically run a generate and open a MR/PR with the new script.
This let's you audit the script and make a plausibility check and you'll be notified on changes but have minimal effort to keep the tests running

Reply View | 2 replies
- hackgician 6 months ago
  
  This is super interesting, is it open source? Would love to talk to you more about how this worked
  
  Reply View | 1 reply
  
  ffsm8 6 months ago
  
  Its not at a stage I'd be comfortable to put it on GitHub yet, maybe in a few months.
  And I think you misunderstood my comment, I didn't describe my project, but extrapolated from the parents desire and my motivations for my project.
  Mine is actually pretty close to stagehand, at least I could very well use it. It's basically a web UI to configure browser tasks like open webpage x, iterate over "item type", with LLM integration to determine what the CSS selector for that would be. On next execution it would attempt to use the previously determined CSS selector instead of the LLM integration. On failures, it'd raise a notification with an admin tasks to verify new selectors/fix the script
  But it's a lot of code to put together as a generic UI - as I want these tasks to be repeatable without restarting from the beginning etc
  Still very much in the PoC stage without any tests, barely working persistence etc
  
  Reply View | 0 replies

Kostarrr 6 months ago

Hi! Kosta from Octomind here.

We built basically this: Let an LLM agent take a look at your web page and generate the playwright code to test it. Running the test is just running the deterministic playwright code.

Of course, the actual hard work is _maintaining_ end-to-end tests so our agent can do that for you as well.

Feel free to check us out, we have a no-hassle free tier.

Reply View 2 replies

hackgician 6 months ago

Octomind is sick, web agents are such an interesting space; would love to talk to you more about challenges you might've faced in building it

Reply View | 1 reply
- Kostarrr 6 months ago
  
  Sorry didnt see this earlier. If you're interested reach out to me (Kosta Welke) on linkedin. Or write me an email, you can find me on Octominds About page.
  
  Reply View | 0 replies

ramesh31 6 months ago

>Personally I'd love to use this as an intermediate workflow for producing deterministic playwright code, but it looks like this is intended for running directly.

Treating UI test code as some kind of static source of truth is the biggest nightmare in all of UI front end development. Web UIs naturally have a ton of "jank" that accumulates over time, which leads to a ton of false negatives; slow API calls, random usages of websockets/SSE, invisible elements, non-idempotent endpoints, etc. etc. And having to write "deterministic" test code for those is the single biggest reason why no one ever actually does it.

I don't care that the page I'm testing has a different DOM structure now, or uses a different button component with a different test ID. All I care about is "can the user still complete X workflow after my changes have been made". If the LLM wants to completely rewrite the underlying test code, I couldn't care less so long as it still achieves that result and is assuring me that my application works as intended E2E.

Reply View 3 replies

mpalmer 6 months ago

> Treating UI test code as some kind of static source of truth is the biggest nightmare in all of UI front end development. Web UIs naturally have a ton of "jank" that accumulates over time, which leads to a ton of false negatives; slow API calls, random usages of websockets/SEE, invisible elements, non-idempotent endpoints, etc. etc. And having to write "deterministic" test code for those is the single biggest reason why no one ever actually does it.
It is, in fact, very possible to extract value from testing methods like this, provided you take the proper care and control both the UI and the tests. It's definitely very easy to end up with a flaky suite of tests that's a net drag on productivity, but it's not inevitable.
On the other hand, I have every confidence that an LLM-based test suite would introduce more flakiness and uncertainty than it could rid me of.

Reply View | 2 replies
- ramesh31 6 months ago
  
  >provided you take the proper care and control both the UI and the tests.
  And no one ever does. There is zero incentive to spend days wrangling with a flakey UI test throwing a false positive for your new feature, and so the test gets skipped and everyone moves on and forgets about it. I have literally never seen a project where UI tests were continually added to and maintained after the initial build out, simply because it is an immense time sink with no visible or perceived value to product, business, or users, and requires tons of manual maintenance to keep in sync with the application.
  
  Reply View | 0 replies
- Klaster_1 6 months ago
  
  What's your secret to "proper care and control both the UI and the tests"? If you meant jankiness @ramesh31 mentioned and me in a sibling comment, then that's exactly what I expect for AI tools to solve and achieve a productivity boost.
  
  Reply View | 0 replies

hackgician 6 months ago

Interesting, thanks for the feedback! By "taking the place of Playwright," we don't mean the AI itself is going to replace Playwright. Rather, you can continue to use existing Playwright code with new AI functionalities. In addition, we don't really intend for Stagehand to be used in a test suite (though you could!).

Rather, we want Stagehand to assist people who want to build web agents. For example, I was using headless browsers earlier in 2024 to do real-time RAG on e-commerce websites that could aggregate results for vibes-based search queries. These sites might have random DOM changes over time that make it hard to write sustainable DOM selectors, or annoying pop-ups that are hard to deterministically code against.

This is the perfect use for Stagehand! If you're doing QA on your own site, then base Playwright (as you mention) is likely the better solution

Reply View 9 replies

andrewmcwatters 6 months ago

It seems to me like Selenium would have been a more appropriate API to extend from, then. Playwright, despite whatever people want it to be otherwise, is explicitly positioned for testing, first.
People in the browser automation space consistently ignore this, for whatever reason. Though, it's right on their site in black and white.

Reply View | 3 replies
- hackgician 6 months ago
  
  Appreciate the feedback. Our take is that Playwright is an open-sourced library with a lot of built-in features that make building with it a lot easier, so it's definitely an easier starting point for us
  
  Reply View | 2 replies
  
  andrewmcwatters 6 months ago
  
  That's the same reason everyone else ignores the fact that it's a testing library. Except now you're forcing users to write kludges that wrap around the testing interface.
  
  Reply View | 0 replies
  
  mrbluecoat 6 months ago
  
  Working with Selenium has always been painful for me so I for one thank you for providing an alternate solution.
  
  Reply View | 0 replies
cjonas 6 months ago

How do you get by when every major sites starts blocking headless browsers? A good example right now is Zillow, but I foresee a world where big chunks of the internet are behind captcha and bot detection

Reply View | 4 replies
- andrewmcwatters 6 months ago
  
  That's not really a problem for Stagehand. It's a problem for Selenium, Playwright, Puppeteer and others at the browser automation library level.
  
  Reply View | 3 replies
  
  cjonas 6 months ago
  
  it's not really a problem for Playwrite, because Playwrite is really intended to be run by the owners of the website, not as a webscraper.
  It may become a real problem for the usefulness of this style of LLM driven browsing.
  
  Reply View | 2 replies