Windows-Use: an AI agent that interacts with Windows at GUI layer

kh9000 10 hours ago

Using the UIA tree as the currency for LLMs to reason over always made more sense to me than computer vision, screenshot based approaches. It’s true that not all software exposes itself correctly via UIA, but almost all the important stuff does. VS code is one notable exception (but you can turn on accessibility support in the settings)

Reply View 8 replies

philipbjorge 7 hours ago

Important is subjective — In the healthcare space, I’d make the claim that most applications don’t expose themselves correctly (native or web).
CV and direct mouse/kb interactions are the “base” interface, so if you solve this problem, you unlock just about every automation usecase.
(I agree that if you can get good, unambiguous, actionable context from accessibility/automation trees, that’s going to be superior)

Reply View | 1 reply
- [removed] 5 hours ago
  
  [deleted]
  
  Reply View | 0 replies
freedomben 9 hours ago

Agreed. I've noticed ChatGPT when parsing screenshots writes out some Python code to parse it, and at least in the tests I've done (with things like, "what is the RGB value of the bullet points in the list" or similar) it ends up writing and rewriting the script five or so times and then gives up. I haven't tried others so I don't know if their approach is unique or not, but it definitely feels really fragile and slow to me

Reply View | 2 replies
- Juminuvi 3 hours ago
  
  I noticed something similar. I asked it extract a guid from an image and it wrote a python script to run ocr against it...and got it wrong. Prompting a bit more seemed to finally trigger it to use it's native image analysis but I'm not sure what the trick was.
  
  Reply View | 1 reply
  
  morkalork an hour ago
  
  I've run into this with uploading audio and text files, have to yell at it to not write any code and use it's native abilities to do the job.
  
  Reply View | 0 replies
akurilin 7 hours ago

I recently tried using Qwen VL or Moondream to see if off-the-shelf they would be able to accurately detect most of the interesting UI elements on the screen, either in the browser or your average desktop app.
It was a somewhat naive attempt, but it didn't look like they performed well without perhaps much additional work. I wonder if there are models that do much better, maybe whatever OpenAI uses internally for operator, but I'm not clear how bulletproof that one is either.
These models weren't trained specifically for UI object detection and grounding, so, it's plausible that if they were trained on just UI long enough, they would actually be quite good. Curious if others have insight into this.

Reply View | 0 replies
[removed] 9 hours ago

[deleted]

Reply View | 0 replies
nikanj 7 hours ago

Most Electron software doesn't follow accessibility guidelines and exposes nothing over UIA

Reply View | 0 replies

philfreo 10 hours ago

Cool. Reminds me of using SendKeys() in Visual Basic 6 in the 90s

https://learn.microsoft.com/en-us/dotnet/api/microsoft.visua...

Reply View 3 replies

sebastiennight 6 hours ago

I loved SendKeys()!
Used it to write programs that would run in the background & spook my friends by "typing" quotes from movies at random times on their computer.

Reply View | 0 replies
halfcat 2 hours ago

SendKeys() in VB powered basically all of the AOL chat bots in the 90’s.
It’s how I accidentally learned the Win32 API

Reply View | 0 replies
anthk 7 hours ago

And BeOS/Haiku with the "Hey" command which does literally the same, but far more than key input. You can interact with widgets too. Under Unix, there's xdotool and friends.

Reply View | 0 replies

mtVessel 8 hours ago

I feel vaguely vindicated that the agent can't figure out how to use the modern Save as workflow, either, and reverts to the traditional dialog.

Reply View 0 replies

electroly 9 hours ago

Looks awesome. I've attempted my own implementation, but I never got it to work particularly well. "Open Notepad and type Hello World" was a triumph for me. I landed on the UIA tree + annotated screenshot combination, too, but mine was too primitive, and I tried to use GPT which isn't as good at image tasks as Gemini as used here. Great job!

Reply View 0 replies

yodon 10 hours ago

Very cool - does anyone know of an OSX equivalent?

Preferably one that is similarly able to understand and interact with web page elements, in addition to app elements and system elements.

Reply View 2 replies

CharlesW 10 hours ago

There are MCPs that work with the macOS Accessibility stack, like https://github.com/steipete/macos-automator-mcp, https://github.com/ashwwwin/automation-mcp, https://github.com/mediar-ai/MacosUseSDK, and https://github.com/baryhuang/mcp-remote-macos-use.
For web page elements, you could drive the browser via `do JavaScript` or use a dedicated browser MCP (Chrome DevTools MCP, Playwright MCP).

Reply View | 0 replies
[removed] 9 hours ago

[deleted]

Reply View | 0 replies

dvt 5 hours ago

Working on something very similar in Rust. It's quite magical when it works (that's a big caveat, as I'm trying to make it work with local LLMs). Very cool implementation, and imo, this is the future of computing.

Reply View 0 replies

AfterHIA 5 hours ago

I remember an older friend asking me recently; will there be a thing soon where I can make my computer go on auto-pilot?

I guess I can answer, "yes I think so."

Reply View 0 replies

KaseKun 5 hours ago

Can it farm a ber rune for me?

Reply View 1 reply

alexchantavy 3 hours ago

Yeahh computer-use agents remind me of game automators like RuneScape autoclickers back in the day like SCAR: I posted on this a while back haha https://news.ycombinator.com/item?id=29716900#29720860

Reply View | 0 replies

tiahura 8 hours ago

LLM’s do a pretty good job of using pywin32 for programs that support COM like office.

Reply View 0 replies