Comment by AIPedant

I was outraged to learn that the test is basically a combination of multiple-choice and true-false. What a misleading pile of crap this study is!

a) Doing well on multiple-choice tests does not really imply anything about doing well in a real lab setting, especially for an AI.

b) Using multiple-choice tests to compare LLMs to humans has an obvious flaw: LLMs are probably superhuman at guessing multiple-choice answers based on superficial statistics! We all learned as children that there are ways to game multiple-choice questions even if you have no idea what the answer is.

c) It is just unacceptably lazy for AI researchers to do this. They aren't teachers on a tight deadline. There is no justification whatsoever for AI researchers to use Scantrons.

I am truly disgusted with the poor scientific rigor in AI research. So depressing.