Comment by frabonacci

Thanks, really appreciate it!

The LLM interacts with the VM through a structured virtual computer interface (cua-computer and cua-agent). It’s a high-level abstraction that lets the agent act (e.g., “open Terminal”, “type a command”, “focus an app”) and observe (e.g., current window, file system, OCR of the screen, active processes) in a way that feels a lot more like using a real computer than parsing raw data.

So under the hood, yes, screen+metadata are used (especially with the Omni loop and visual grounding), but what the model sees is a clean interface designed for agentic workflows - closer to how a human would think about using a computer.

If you're curious, the agent loops (OpenAI, Anthropic, Omni, UI-Tars) offer different ways of reasoning and grounding actions, depending on whether you're using cloud or local models.

https://github.com/trycua/cua/tree/main/libs/agent#agent-loo...

First off- this is great, and I think there are use-cases for this. Being able to even partially isolate could be helpful.

Second, as a user, you’d want to handle the case where some or all of these have been fully compromised. Surreptitiously, super-intelligently, and partially or fully autonomously, one container or many may have access to otherwise isolated networks within homes, corporate networks, or some device in a high security area with access to a nuclear weapons, biological weapons, the electrical grid, our water supply, our food supplies, manufacturing, or even some other key vulnerability we’ve discounted, like a toy.

While providing more isolation is good, there is no amount of caution that can prevent calamity when you give everyone a Pandora’s box. It’s like giving someone a bulletproof jacket to protect them from fox tapeworm cancer or hyper-intelligent, time-traveling, timespace-manipulating super-Ebola.

That said, it’s the world we live in now, where we’re in a race to our demise. So, thanks for the bulletproof jacket.