Comment by wsxiaoys

I've spent the last few months working on a custom RL model for coding tasks. The biggest headache has been the lack of good tooling for tuning the autorater's prompt. (That's the judge that gives the training feedback.) The process is like any other quality-focused task—running batch rating jobs and doing SxS evaluations—but the tooling really falls short. I think I'll have to build my own tools once I wrap up the current project