Comment by ranyume
> What does that leave us with?
At the start, with no benchmark. Because LLMs can't reason at this time, and because we don't have a reliable way of grading LLM reasoning, and because people are stubborn thinking LLMs are actually reasoning we're at the start. When you ask a LLM "2 + 2 = ", it doesn't add the numbers together, it just looks up one of the stories it memorized and return what happens next. Probably in some such stories 2 + 2 = fish.
Similarly, when you're asking a LLM to grade another LLM, it's just looking up what happens next in it's stories, not even following instructions. "Following" instructions requires thinking, hence it's not even following instructions. But you can say you're commanding the LLM, or programming the LLM, so you have full responsibility for what the LLM produces, and the LLM has no authorship. Put in another way, the LLM cannot make something you yourself can't... at this point, in which it can't reason.
You have an outmoded understanding of how LLMs work (flawed in ways that are "not even wrong"), a poor ontological understanding of what reasoning even is, and too certain that your answers to open questions are the right ones.