Comment by simonw
You mean instead of them running the code that they are writing they pretend to run the code and the model shows what it thinks would happen?
I don't like that at all. Actually running the code is the single most effective protection we have against coding mistakes, from both humans and machines.
I think it's absolutely worth the complexity and performance overhead of hooking up a real container environment.
Not to mention you can run a useful code execution container in 100MB of RAM on a single CPU (or slice thereof). Simulating that with an LLM takes at least one GPU and 100GB or more of VRAM.
I understand your point but I basically find myself running all my agents in barebones containers and they’re basically short-run make-or-kill types. And once we ramp up agent counts, possibly into the thousands, that could add up rapidly. Of course, you would run milestone tests on actual container/envs but I think there might be a need for lighter solutions for rapid agent dev runs.