Comment by dr_dshiv
Pretty serious flaws in the original paper.
1. Scoring unsolvable challenges as incorrect
2. Not accounting for token span
3. Not allowing LLMs to code as part of solution.
I tend to see Apple’s paper as an excuse for not having competitive products.
Sounds like confirmation bias in action