Comment by eesmith

Comment by eesmith 20 hours ago

2 replies

How so?

To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.

We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.

Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.

In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.

How has that work been made obsolete?

bufferoverflow 19 hours ago

How so? All the models they've tested are obsolete, multiple generations behind high-end versions.

(Though even these obsolete models did better than the best humans and domain experts).

  • eesmith 19 hours ago

    As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.

    Good benchmark development is hard work. The paper goes into the details of how it was carried out.

    Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.

    You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.

    That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.