I built this tool to enable natural language-driven automation on Android devices.
While many existing agents rely heavily on vision (taking screenshots and analyzing pixels), I took a different approach: XML parsing.
By analyzing the UI hierarchy directly via XML, the agent can:
Achieve precise positioning and interaction (clicking via index).
Run faster and more efficiently.
Work effectively with LLMs that don't have vision capabilities (or use cheaper text-only models like Deepseek/Kimi/Qwen).
It's open source, and I'd love to hear your feedback or answer any questions about the implementation!
Hi HN, I'm the creator of Android Use.
I built this tool to enable natural language-driven automation on Android devices.
While many existing agents rely heavily on vision (taking screenshots and analyzing pixels), I took a different approach: XML parsing.
By analyzing the UI hierarchy directly via XML, the agent can:
Achieve precise positioning and interaction (clicking via index). Run faster and more efficiently. Work effectively with LLMs that don't have vision capabilities (or use cheaper text-only models like Deepseek/Kimi/Qwen). It's open source, and I'd love to hear your feedback or answer any questions about the implementation!