Comment by blackbear_
Comment by blackbear_ 4 days ago
They mention reinforcement learning, so I guess they used some sort of Monte Carlo tree search (the same algorithm used for AlphaGo).
In this case, the model would explore several chain of thoughts during training, but only output a single chain during inference (as the sibling comment suggests).
as someone who works in this field, this comment is obviously uninformed even about old public research trends