Comment by kh9000
Using the UIA tree as the currency for LLMs to reason over always made more sense to me than computer vision, screenshot based approaches. It’s true that not all software exposes itself correctly via UIA, but almost all the important stuff does. VS code is one notable exception (but you can turn on accessibility support in the settings)
Important is subjective — In the healthcare space, I’d make the claim that most applications don’t expose themselves correctly (native or web).
CV and direct mouse/kb interactions are the “base” interface, so if you solve this problem, you unlock just about every automation usecase.
(I agree that if you can get good, unambiguous, actionable context from accessibility/automation trees, that’s going to be superior)