Comment by amelius
Step 1. Train a VLM to supervise the RL training.
Step 2. Train the RL network. In the mean time drink coffee or work on plan of world domination.
Step 1. Train a VLM to supervise the RL training.
Step 2. Train the RL network. In the mean time drink coffee or work on plan of world domination.
My understanding is that this is essentially how RLHF works, and it doesn't scale. As you run RL for longer, the model will learn how to cheat the imperfections of the grader, instead of getting better at the task at hand. Therefore, to scale RL you really need good graders, and determinism is king.