Comment by DoctorOetker

It very sad there is so much gaming of metrics with LLMs.

If we wish to avoid everyone creating benchmarks for themselves, then instead of predetermined benchmarks (public ones allow gaming, while publicly scored private ones require blind trust) we could use gradient descent on sentences to find disagreements between models, and then present them to human domain experts.

At least it could be public without possibility of leaking (since the model creators don't yet know of all possible disagreements between LLM's, which ones will be selected for review by human experts)