Comment by Straw
This post mischaracterizes AlphaZero/MuZero.
AlphaZero/MuZero are not model based, and aren't on policy either. They train at a significantly higher temperature, and thus intentional suboptimal play, than they use when playing to win. LeelaChessZero has further improvements to reduce the bias on training on suboptimal play.
There's a well known tradeoff in TD learning based on how many steps ahead you look- 1-step TD converges off policy, but can give you total nonsense/high bias when your Q function isn't trained. Many-step can't give you nonsense because it scores based on the real result, but that real result came from suboptimal play so your own off-policyness biases the results, plus it's higher variance. It's not hard to adjust between these two in AlphaZero training as you progress to minimize overall bias/variance. (As in, AlphaZero can easily do this- I'm not saying the tuning of the schedule of how to do it is easy!)
It is the weekend, so let’s anthropomorphize.
The idea of sub-optimal play to increase learning is interesting. We can notice the human social phenomenon of being annoyed at players who play games “too mechanically” or boringly, and admiration of players (even in an overly studied game like chess) with an “intuitive” style.
I wonder how AI training strategies would change if the number of games they were allowed to play was fairly limited, to the handful of thousands of matches of a game that a person might play over the course of their lives. And perhaps if their “rank” was evaluated over the course of their training, like it is for humans.