Comment by mumbisChungo

"...advanced reasoning models like Gemini 2.5 Pro and Claude-3.7-Sonnet (Thinking) can occasionally identify the specific benchmark origin of transcripts (including SWEBench, GAIA, and MMLU), indicating evaluation-awareness via memorization of known benchmarks from training data. Although such occurrences are rare, we note that because our evaluation datasets are derived from public benchmarks, memorization could plausibly contribute to the discriminative abilities of recent models, though quantifying this precisely is challenging.

Moreover, all models frequently acknowledge common benchmarking strategies used by evaluators, such as the formatting of the task (“multiple-choice format”), the tendency to ask problems with verifiable solutions, and system prompts designed to elicit performance"

Beyond the awful, sensational headline, the body of the paper is not particularly convincing, aside from evidence that the pattern matching machines pattern match.