Comment by janalsncm

> nothing about saying they use rl implies they use mcts

We can say the same thing about RL implying PPO, however there’s pretty big hints, namely Noam Brown being involved. Many of the things Noam Brown has worked on involve RL in tree search contexts.

He has also been consistently advocating the use of additional test-time compute to solve search problems. This is also consistent with the messaging regarding the reasoning tokens. There is likely some learned tree search algorithm, such as a learned policy/value function as in AlphaGo.

It’s all speculation until we have an actual paper. So we can’t categorically say MCTS/learned tree search isn’t involved.