Comment by YetAnotherNick

RL is a training method and it improves the model itself. So basically one step(e.g. successful test run, finding search result) could create positive and negative examples for the other step(e.g. coding agent, search agent). And using this the base itself will improve to satisfy other demands and if it reaches close to 100% accuracy(which I believe it could as models mostly fail due to dumb mistakes in tests), you don't need the testing agent altogether.