Comment by aoeusnth1

Definitely a good question. Using an actual LLM as the execution layer allows us to more easily swap to the planner agent in the case that the test needs to be adapted. We don’t want to store just a selector based test because it’s difficult to determine when it requires adaptation, and is inherently more brittle to subtle UI changes. We think using a tiny model like Moondream makes this cheap enough that these benefits outweigh an approach where we cache actual playwright code.