Comment by bufferoverflow
Comment by bufferoverflow 19 hours ago
How so? All the models they've tested are obsolete, multiple generations behind high-end versions.
(Though even these obsolete models did better than the best humans and domain experts).
As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.
Good benchmark development is hard work. The paper goes into the details of how it was carried out.
Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.
You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.
That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.