Comment by up6w6

I am very suspicious of the results. A few months ago they published a LLM benchmark, calling it "perfect" while it actually contained like only 50 inputs (academic benchmark datasets usually contain tens of thousands of inputs).