Comment by whistle650
Comment by whistle650 a day ago
It seems they use 70% of the benchmark query-answer pairs to cluster and determine which models work best for each cluster (by sending all queries to all models and looking at responses vs ground truth answers). Then they route the remaining 30% "test" set queries according to those prior determinations. It doesn't seem surprising that this approach would give you Pareto efficiency on those benchmarks.
It's ok if you can update the router over time, the more data you have the better.