Comment by up6w6

Comment by up6w6 6 days ago

0 replies

I am very suspicious of the results. A few months ago they published a LLM benchmark, calling it "perfect" while it actually contained like only 50 inputs (academic benchmark datasets usually contain tens of thousands of inputs).