Comment by maxrmk
Very cool. Do you do anything to mitigate ordering bias in the evaluation function, or do you just expect it to average out over time?
Very cool. Do you do anything to mitigate ordering bias in the evaluation function, or do you just expect it to average out over time?
No, we don't do anything. Theoretically we could judge several times with different ordering.
We could measure order bias really easily though; we just need to look at the average score by rollout position across many runs. I'll add that to my list of experiments!