Comment by bufferoverflow

As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.

Good benchmark development is hard work. The paper goes into the details of how it was carried out.

Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.

You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.

That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.