Comment by eesmith
How so?
To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.
We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.
Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.
In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.
How has that work been made obsolete?
How so? All the models they've tested are obsolete, multiple generations behind high-end versions.
(Though even these obsolete models did better than the best humans and domain experts).