Show HN: Stagehand – an open source browser automation framework powered by AI

326 points by hackgician 10 months ago

Hi HN! I’m Anirudh — longtime lurker, first time poster, and I couldn’t be more excited to show you Stagehand.

Stagehand is a TypeScript project that extends Playwright with three simple AI methods — act, extract, and observe. We’d love for you to try it out using the command below:

    npx create-browser-app --example quickstart

Here’s a sample workflow:

    const stagehand = new Stagehand();
    await stagehand.init();

    // Stagehand overrides the Playwright Page and Context classes
    const { page, context } = stagehand

    await page.goto("instadash.com") // Regular Playwright

    // Take action on the page
    await page.act({ action: "click on taqueria cazadores" })

    // Extract relevant data from the page
    const { price } = await page.extract({
        instruction: "extract the price of the super burrito",
        schema: z.object({
            price: z.number()
        })
    })

We built Stagehand because we loved building browser automations using Playwright and Selenium, but we grew frustrated at how cumbersome it is to just get started and write simple browser automations. These frameworks, while incredibly powerful, are built for QA testing and are thus notoriously prone to fail if there are minor changes in the UI or underlying DOM structure.

The goal of Stagehand is twofold:

1. Make browser automations easier to write 2. Make browser automations more resilient to DOM changes.

We were super energized by what we’ve been seeing with vision-based computer use agents. We think with a browser, you can provide even richer data by leveraging the information in the DOM + a11y tree in addition to what’s rendered on the page. However, we didn’t want to go so far as to build an agent, since we wanted fine-grained control over each step that an agent can take.

Therefore, the happy medium we built was to extend the existing powerful functionalities of Playwright with simple and extensible AI APIs that return the decision-making power back to the developer at each step.

Check out our docs: https://docs.stagehand.dev

We’d love for you to join and give us feedback on Slack as well: https://stagehand.dev/slack

dchuk 10 months ago

This looks awesome.

What I would love to see either as something leveraging this, or built in to this, is if you prompt stagehand to extract data from a page, it also returns the xpath elements you'd use to re-scrape the page without having to use an LLM to do that second scraping.

So basically, you can scrape new pages never before seen with the non-deterministic LLM tool, and then when you need to rescrape the page again to update content for example, you can use the cheaper old-school scraping method.

Not sure how brittle this would be both going from LLM version to xcode version reliably, or how to fallback to the LLM version if your xcode script fails, but overall conceptually, being able to scrape using the smart tools but then building up basically a library of dumb scraping scripts over time would be killer.

Reply View 5 replies

chaosharmonic 10 months ago

I've been on a similar thread w my own crawler project -- conceptually at least, since I'm intentionally building as much of it by hand as possible... Anyway, after a lot of browser automation, I've realized that it's more flexible and easier to maintain to just use a DOM polyfill server-side and then use the client to get raw HTML responses wherever possible. (And, in conversations about similar LLM-focused tools, that if you generate parsing functions you can reuse, you don't necessarily need an LLM to process your results.)
I'm still trying to figure out the boundaries of where and how I want to scale that out into other stuff -- things like when to use `page` methods directly, vs passing a function into `page.evaluate`, vs other alternatives like a browser extension or a CLI tool. And I'm still needing to work around smaller issues with the polyfill and its spec coverage (leaving me to use things like `getAttribute` more than I would otherwise). But in the meantime it's simplified a lot of ancillary issues, like handling failures on my existing workflows and scaling out to new targets, while I work on other bot detection issues.

Reply View | 0 replies
matsemann 10 months ago

Agree. The worst part of integration tests are how brittle they often are. I don't want to introduce yet another thing that could give false test errors.
But of course, the way it works now could also help reduce the brittleness. With an xpath or selector, it quickly breaks when the design changes or things are moved around. With this, it might overcome this.
So tradeoffs, I guess.

Reply View | 0 replies
hackgician 10 months ago

Yeah, I think someone opened a similar issue on GitHub: https://github.com/browserbase/stagehand/issues/389
Repeatability of extract() is definitely super interesting and something we're looking into

Reply View | 1 reply
- 9dev 10 months ago
  
  Cache the response for a given query-page hash pair maybe? So the LLM will only be consulted when the page content hash changes, the previous answer be reused otherwise
  
  Reply View | 0 replies
ushakov 10 months ago

there’s also llm-scraper: https://github.com/mishushakov/llm-scraper
disclaimer: i am the author

Reply View | 0 replies

mpalmer 10 months ago

This looks very cool and makes a lot of sense, except for the idea that it should take the place of Playwright et al.

Personally I'd love to use this as an intermediate workflow for producing deterministic playwright code, but it looks like this is intended for running directly.

I don't think I could plausibly argue for using LLMs at runtime in our test suite at work...

Reply View 21 replies

Klaster_1 10 months ago

It's funny you mentioned "deterministic Playwright code," because in my experience, that’s one of the most frustrating challenges of writing integration tests with browser automation tools. Authoring tests is relatively easy, but creating reliable, deterministic tests is much harder.
Most of my test failures come down to timing issues—CPU load subtly affects execution, leading to random timeouts. This makes it difficult to run tests both quickly and consistently. While proactive load-testing of the test environment and introducing artificial random delays during test authoring can help, these steps often end up taking more time than writing the tests themselves.
It would be amazing if tools were smart enough to detect these false positives automatically. After all, if a human can spot them, shouldn’t AI be able to as well?

Reply View | 3 replies
- ffsm8 10 months ago
  
  I was working on a side project over the holidays with the (I think) same idea as mpalmer imagined there too (though my project wouldn't be interested to him either, because my goal wasn't automating tests)
  Basically, the goal would be to do it like with screenshot regression tests: basically you get 2 different execution phases: - generate - verify
  And when verify fails in CI, you can automatically run a generate and open a MR/PR with the new script.
  This let's you audit the script and make a plausibility check and you'll be notified on changes but have minimal effort to keep the tests running
  
  Reply View | 2 replies
  
  hackgician 10 months ago
  
  This is super interesting, is it open source? Would love to talk to you more about how this worked
  
  Reply View | 1 reply
  
  ffsm8 10 months ago
  
  Its not at a stage I'd be comfortable to put it on GitHub yet, maybe in a few months.
  And I think you misunderstood my comment, I didn't describe my project, but extrapolated from the parents desire and my motivations for my project.
  Mine is actually pretty close to stagehand, at least I could very well use it. It's basically a web UI to configure browser tasks like open webpage x, iterate over "item type", with LLM integration to determine what the CSS selector for that would be. On next execution it would attempt to use the previously determined CSS selector instead of the LLM integration. On failures, it'd raise a notification with an admin tasks to verify new selectors/fix the script
  But it's a lot of code to put together as a generic UI - as I want these tasks to be repeatable without restarting from the beginning etc
  Still very much in the PoC stage without any tests, barely working persistence etc
  
  Reply View | 0 replies
Kostarrr 10 months ago

Hi! Kosta from Octomind here.
We built basically this: Let an LLM agent take a look at your web page and generate the playwright code to test it. Running the test is just running the deterministic playwright code.
Of course, the actual hard work is _maintaining_ end-to-end tests so our agent can do that for you as well.
Feel free to check us out, we have a no-hassle free tier.

Reply View | 2 replies
- hackgician 10 months ago
  
  Octomind is sick, web agents are such an interesting space; would love to talk to you more about challenges you might've faced in building it
  
  Reply View | 1 reply
  
  Kostarrr 10 months ago
  
  Sorry didnt see this earlier. If you're interested reach out to me (Kosta Welke) on linkedin. Or write me an email, you can find me on Octominds About page.
  
  Reply View | 0 replies
ramesh31 10 months ago

>Personally I'd love to use this as an intermediate workflow for producing deterministic playwright code, but it looks like this is intended for running directly.
Treating UI test code as some kind of static source of truth is the biggest nightmare in all of UI front end development. Web UIs naturally have a ton of "jank" that accumulates over time, which leads to a ton of false negatives; slow API calls, random usages of websockets/SSE, invisible elements, non-idempotent endpoints, etc. etc. And having to write "deterministic" test code for those is the single biggest reason why no one ever actually does it.
I don't care that the page I'm testing has a different DOM structure now, or uses a different button component with a different test ID. All I care about is "can the user still complete X workflow after my changes have been made". If the LLM wants to completely rewrite the underlying test code, I couldn't care less so long as it still achieves that result and is assuring me that my application works as intended E2E.

Reply View | 3 replies
- mpalmer 10 months ago
  
  > Treating UI test code as some kind of static source of truth is the biggest nightmare in all of UI front end development. Web UIs naturally have a ton of "jank" that accumulates over time, which leads to a ton of false negatives; slow API calls, random usages of websockets/SEE, invisible elements, non-idempotent endpoints, etc. etc. And having to write "deterministic" test code for those is the single biggest reason why no one ever actually does it.
  It is, in fact, very possible to extract value from testing methods like this, provided you take the proper care and control both the UI and the tests. It's definitely very easy to end up with a flaky suite of tests that's a net drag on productivity, but it's not inevitable.
  On the other hand, I have every confidence that an LLM-based test suite would introduce more flakiness and uncertainty than it could rid me of.
  
  Reply View | 2 replies
  
  ramesh31 10 months ago
  
  >provided you take the proper care and control both the UI and the tests.
  And no one ever does. There is zero incentive to spend days wrangling with a flakey UI test throwing a false positive for your new feature, and so the test gets skipped and everyone moves on and forgets about it. I have literally never seen a project where UI tests were continually added to and maintained after the initial build out, simply because it is an immense time sink with no visible or perceived value to product, business, or users, and requires tons of manual maintenance to keep in sync with the application.
  
  Reply View | 0 replies
  
  Klaster_1 10 months ago
  
  What's your secret to "proper care and control both the UI and the tests"? If you meant jankiness @ramesh31 mentioned and me in a sibling comment, then that's exactly what I expect for AI tools to solve and achieve a productivity boost.
  
  Reply View | 0 replies
hackgician 10 months ago

Interesting, thanks for the feedback! By "taking the place of Playwright," we don't mean the AI itself is going to replace Playwright. Rather, you can continue to use existing Playwright code with new AI functionalities. In addition, we don't really intend for Stagehand to be used in a test suite (though you could!).
Rather, we want Stagehand to assist people who want to build web agents. For example, I was using headless browsers earlier in 2024 to do real-time RAG on e-commerce websites that could aggregate results for vibes-based search queries. These sites might have random DOM changes over time that make it hard to write sustainable DOM selectors, or annoying pop-ups that are hard to deterministically code against.
This is the perfect use for Stagehand! If you're doing QA on your own site, then base Playwright (as you mention) is likely the better solution

Reply View | 9 replies
- andrewmcwatters 10 months ago
  
  It seems to me like Selenium would have been a more appropriate API to extend from, then. Playwright, despite whatever people want it to be otherwise, is explicitly positioned for testing, first.
  People in the browser automation space consistently ignore this, for whatever reason. Though, it's right on their site in black and white.
  
  Reply View | 3 replies
  
  hackgician 10 months ago
  
  Appreciate the feedback. Our take is that Playwright is an open-sourced library with a lot of built-in features that make building with it a lot easier, so it's definitely an easier starting point for us
  
  Reply View | 2 replies
- cjonas 10 months ago
  
  How do you get by when every major sites starts blocking headless browsers? A good example right now is Zillow, but I foresee a world where big chunks of the internet are behind captcha and bot detection
  
  Reply View | 4 replies
  
  andrewmcwatters 10 months ago
  
  That's not really a problem for Stagehand. It's a problem for Selenium, Playwright, Puppeteer and others at the browser automation library level.
  
  Reply View | 3 replies

asar 10 months ago

This looks really cool, thanks for sharing!

I recently tried to implement a workflow automation using similar frameworks that were playwright or puppeteer based. My goal was to log into a bunch of vendor backends and extract values for reporting (no APIs available). What stopped me entirely were websites that implemented an invisible captcha. They can detect a playwright instance by how it interacts with the DOM. Pretty frustrating, but I can totally see this becoming a standard as crawling and scraping is getting out of control.

Reply View 1 reply

hackgician 10 months ago

Thanks so much! Yes, a lot of antibots are able to detect Playwright based on browser config. Generally, antibots are a good thing -- I think in the future, as web agents become more popular, I'd imagine a fruitful partnership to prevent misuse if it's coming from a trusted web agent v. an unknown one

Reply View | 0 replies

z3t4 10 months ago

My kneejerk reflex: "create-browser-app" is a very generic name, should just have called it "stagehand"

Reply View 0 replies

sparab18 10 months ago

I've been playing around with Stagehand for a minute now, actually a useful abstraction here. We build scrapers for websites that are pretty adversarial, so having built in proxies and captcha is delightful.

Do you guys ever think you'll do a similar abstraction for MCP and computer use more broadly?

Reply View 5 replies

hackgician 10 months ago

Thanks so much! Our Stagehand MCP server actually won Anthropic's Claude MCP hackathon :) Check it out: https://github.com/browserbase/mcp-server-browserbase/tree/m...
We're working on a better computer use integration using Stagehand, def a lot of interesting potential there

Reply View | 4 replies
- chw9e 10 months ago
  
  The MCP server is useful, I built a demo of converting bug reports into Playwright tests using it - https://qckfx.com/demo
  
  Reply View | 2 replies
  
  hackgician 10 months ago
  
  Whoa -- this is so cool! Is this open source? Would love to check it out
  
  Reply View | 1 reply
  
  chw9e 10 months ago
  
  No, but happy to chat about it. My email is chris.wood@qckfx.com
  
  Reply View | 0 replies
- jimmySixDOF 10 months ago
  
  interesting and hope to see this improve with open source GUI Agent vision model projects like OS-Atlas
  https://osatlas.github.io/
  
  Reply View | 0 replies

xingwu 10 months ago

Can the script be compiled into actual DOM operations so that we don't need LLM for every run？

Reply View 0 replies

tomatohs 10 months ago

Cool! Before building a full test platform for testdriver.ai we made a similar sdk called Goodlooks. It didn't get much traction, but will leave it here for those interested: https://github.com/testdriverai/goodlooks

Reply View 1 reply

hackgician 10 months ago

This is sick! Starred, thanks for sharing :)

Reply View | 0 replies

zanesabbagh 10 months ago

Have been on the Slack for a while and this crew has had an insane product velocity. Excited to see where it goes!

Reply View 1 reply

hackgician 10 months ago

Thanks so much Zane!!

Reply View | 0 replies

pryelluw 10 months ago

Can it be adapted to use ollama? Seems like a good tool to setup locally as a navigation tool.

Reply View 4 replies

hackgician 10 months ago

Yes, you can certainly use Ollama! However, we strongly recommend using a more beefed up model to get sustainable results. Check out our external_client.ts file in examples/ that shows you how to setup a custom LLMClient: <https://github.com/browserbase/stagehand/blob/main/examples/...>

Reply View | 3 replies
- fidotron 10 months ago
  
  It doesn’t look like accessing the llmclient for this is possible for external projects in the latest release, as that example takes advantage of being inside the project. (At least working through the quick start guide).
  
  Reply View | 2 replies
  
  hackgician 10 months ago
  
  We accidentally didn't release the right types for LLMClient :/ However, if you set the version in package.json to "alpha", it will install what's on the main branch on GitHub, which should have the typing fix there
  
  Reply View | 1 reply
  
  fidotron 10 months ago
  
  Yeah I saw it was a recent change in your GitHub and was happily running your examples.
  To be honest I took about 2 minutes of playing around to get annoyed with the inaccuracies of the locally hosted model for that, so I get why you encourage the other approaches.
  
  Reply View | 0 replies

fbouvier 10 months ago

Hey Anirudh, Stagehand looks awesome, congrats. Really love the focus on making browser automations more resilient to DOM changes. The act, extract, and observe methods are super clean.

You might want to check out Lightpanda (https://github.com/lightpanda-io/browser). It's an open-source, lightweight headless browser built from scratch for AI and web automation. It's focused on skipping graphical rendering to make it faster and lighter than Chrome headless.

Reply View 4 replies

qeternity 10 months ago

I don't really follow: a lot of the fragility of web automation comes from the programmatic vs. visual differences, which VLMs are able to overcome. Skipping the graphical rendering seems to be committing yourself to non-visual hell.
The web isn't made for agents and automation. It's made for people.

Reply View | 1 reply
- hackgician 10 months ago
  
  Yes and no. Getting a VLM to work on the web would definitely be great, but it comes with its own problems, mainly around developing and acting on bounding boxes. We have vision as a default fallback for Stagehand, but we've found that the screenshot sent to the VLM often has to have pre-labeled elements on it. More notably, the screenshot with everything prelabeled leads to a cluttered and unusable image to process. Not pre-labeling runs the risk of missing important elements. I imagine a happy medium where the DOM+a11y tree can be used for candidate generation to a VLM.
  Solely depending on a VLM is indeed reminiscent of how humans interact with the web, but when a model thrives with more data, why restrict the data sent to the model?
  
  Reply View | 0 replies
TheTaytay 10 months ago

Lightpanda does look promising, but this is an important note from the readme: " You should expect most websites to fail or crash."

Reply View | 1 reply
- fbouvier 10 months ago
  
  You're absolutely right, the 'most websites will fail' note is there because we're still in development, and the browser doesn't yet handle the long tail of web APIs.
  That said, the architecture's coming together and the performance gains we're seeing make us excited about what's possible as we keep building. Feedback is very welcome, especially on what APIs you'd like to see us prioritize for specific workflows and use cases.
  
  Reply View | 0 replies

bluelightning2k 10 months ago

Does this open up the possibility of automating an existing open browser tab? (Instead of a headless or specifically opened instance of chrome?)

Reply View 3 replies

its_down_again 10 months ago

Have you looked into agentic chrome extensions like MultiOn? They use a similar class of AI model, but work on top of existing open browser tabs.

Reply View | 0 replies
namanyayg 10 months ago

Afaik no. But if it's access to authenticated resources that you want, you can do so by copying over cookies.

Reply View | 1 reply
- hackgician 10 months ago
  
  Yes^ this is what we suggest. Stagehand is meant to execute isolated tasks on browsers; we support using custom contexts (cookies) with the following command:
  npx create-browser-app --example persist-context
  
  Reply View | 0 replies

jerrygoyal 10 months ago

wow. It's like cursor vs vscode movement but for browser automation and scrapping. Kudos to the author. Are there any other similar tools?

Reply View 3 replies

andrethegiant 10 months ago

https://crawlspace.dev has a similar LLM-aware scraping where you can pass a Zod object and it’ll match the schema, but is available as a PaaS offering with queueing / concurrency / storage built in [disclaimer: I’m the founder]

Reply View | 0 replies
hackgician 10 months ago

Thanks so much! Crawlspace is pretty sick too, as is Integuru. A lot of people have different takes here on the level of automation to leave up to the user. As a developer building for developers, I wanted to meet in the middle and build off an existing incumbent that most people are likely familiar with already

Reply View | 1 reply
- insdev12 10 months ago
  
  Yea Integuru is pretty cool: https://github.com/Integuru-AI/Integuru
  
  Reply View | 0 replies

CyberDildonics 10 months ago

People must be excited for this since a lot of people are commenting for the first time in months or years to say how much they love it. Some people liked it so much they commented for the first time ever to say how great it is.

Reply View 4 replies

mkagenius 10 months ago

And here I was in trenches with 2 upvotes with Show HN for https://github.com/BandarLabs/clickclickclick

Reply View | 0 replies
ramesh31 10 months ago

This is 100% the future of UI testing. The dream of BDD and Gherkin can be fully realized now that the actual test code writing/maintenance portion is completely taken care of.

Reply View | 2 replies
- CyberDildonics 10 months ago
  
  This thing that was just released is the future of UI testing? I usually just use the UI to test it.
  
  Reply View | 1 reply
  
  ramesh31 10 months ago
  
  >This thing that was just released is the future of UI testing?
  The general idea of it, yes. No one will be writing selector based tests by hand anymore in a couple years.
  
  Reply View | 0 replies

vitalets 10 months ago

Looks interesting. I know about the similar project - https://zerostep.com. Is it basically the same?

Reply View 0 replies

jsdalton 10 months ago

Does it operate by translating your higher level AI methods into lower level Playwright methods, and if so is it possible to debug the actual methods those methods were translated to?

Also is there some level of deterministic behavior here or might every test run result in a different underlying command if your wording isn’t precise enough?

Reply View 1 reply

hackgician 10 months ago

It's a little hacky, but we have a method in the act() handler called performPlaywrightMethod that takes in a playwright method + xpath and executes the playwright method on the xpath. There's definitely a lot of room for improvement here, and we're working on making observe() fill those gaps. I think observe() aims to be like GitHub Copilot's gray suggested text that you can then confirm in a secondary step; whereas act() takes on a more agentic workflow that you let the underlying agent loop make decisions on your behalf

Reply View | 0 replies

jameslk 10 months ago

Cool to see another open source AI browser testing project! There’s a couple of others I’ve heard of as well:

Skyvern: https://github.com/Skyvern-AI/skyvern

Shortest: https://github.com/anti-work/shortest

I’d love to hear what makes Stagehand different and pros/ cons.

Of course, I have no complaints to see more competition and open source work in this space. Keep up the great work!

Reply View 1 reply

hackgician 10 months ago

Yes! These are both phenomenal projects, and kudos to their authors as well. Stagehand is different in that it makes fine-grained control a first-class citizen. Often times, you want to control the exact steps a web agent takes. Our experience using other tools was that the only control you have over these steps in other tools is in the natural language prompt.
However with Stagehand, because it's an extension of Playwright, it allows you to confirm each step of the underlying agent's workflow, making it the most customizable option for engineers who want/need that

Reply View | 0 replies

righthand 10 months ago

I’m curious how this compares to playwrights already built in codegen:

https://playwright.dev/docs/codegen-intro

Is a chat bot easier to reiterate a test?

Reply View 6 replies

hackgician 10 months ago

Playwright codegen is incredibly powerful, but still pretty brittle. Its DOM selectors are still hardcoded, so you run the risk of Playwright selecting an unsustainable DOM selector. With Stagehand, the code is self-healing since it's dynamically generating Playwright every time, making it much more resilient to minor DOM changes

Reply View | 5 replies
- kevmo314 10 months ago
  
  How do you avoid this becoming horrendously expensive per run? Are the results cached if the DOM doesn't change?
  
  Reply View | 4 replies
  
  hackgician 10 months ago
  
  The purpose of using Playwright is to basically write deterministic workflows in deterministic automation code. We have basic prompt caching right now that works if the DOM doesn't change (as you mention), but also the best way to reduce token cost is to reduce reliance on AI itself. You have the most control over how much you want to rely on AI v. how much you want to write repeatable Playwright code.
  
  Reply View | 3 replies

BrandiATMuhkuh 10 months ago

Congratulations. This is super cool.

I often thought E2E testing should be done with AI. What I want is that the functionality works (e.g.: login, then start an assignment) without the need to change the test each time the UI changes.

Reply View 1 reply

hackgician 10 months ago

Thanks so much! Sounds like Stagehand is a perfect fit; would love to hear your thoughts :)

Reply View | 0 replies

fasten 10 months ago

cool extension to playwright! how effective are the ai methods in handling dynamic ui changes?

Reply View 0 replies

owebmaster 10 months ago

Any attempt on doing something similar but as a browser extension?

Reply View 3 replies

hackgician 10 months ago

That's definitely compelling, but not something we have in mind for the immediate future. Let me know if you end up building something here!

Reply View | 2 replies
- temuze 10 months ago
  
  I'm currently working on it :)
  See you in two weeks I hope
  
  Reply View | 1 reply
  
  owebmaster 10 months ago
  
  anything we can see already?
  
  Reply View | 0 replies

arvindsubram 10 months ago

The easiest way to programmatically browse the web!!

Reply View 1 reply

hackgician 10 months ago

Thanks so much!! Appreciate the feedback

Reply View | 0 replies

fredtalty5 10 months ago

[dead]

Reply View 0 replies

TibbityFlanders 10 months ago

[dead]

Reply View 0 replies