Advancing AI Benchmarking with Game Arena

89 points by salkahfi 6 hours ago

This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -

We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an agent written by Claude plays poker against an agent written by GPT, and this really tough task leads to very interesting findings on AI for coding.

https://codeclash.ai/

Reply View 4 replies

63stack 4 hours ago

>this really tough task leads to very interesting findings on AI for coding
Are you going to share those with the class or?

Reply View | 0 replies
RobRivera 2 hours ago

https://ai.meta.com/research/publications/gaia-a-benchmark-f...
?

Reply View | 0 replies
Instantnoodl 4 hours ago

Cool to see core war! I feel it's mostly forgotten by now. My dad is still playing it to this day though and even attends tournaments

Reply View | 0 replies
riku_iki 5 hours ago

Leaderboard looks very outdated..

Reply View | 0 replies

kenforthewin an hour ago

Let's add NetHack to the mix!

https://kenforthewin.github.io/blog/posts/nethack-agent/

Reply View 0 replies

ZeroCool2u 4 hours ago

I'd really like to see them add a complex open world fully physicalized game like Star Citizen (assuming the game itself is stable) with a single primary goal like accumulating currency as a measure of general autonomy and a proxy for how the model might behave in the real world given access to a bipedal robot.

Reply View 0 replies

iNic 2 hours ago

I feel uneasy about werewolf being included here. I don't want AI labs to actively try and make their LLMs deceptive!

Reply View 0 replies

cv5005 5 hours ago

My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).

Reply View 3 replies

anematode an hour ago

Isn't this a bit too visual-centric? By this criterion Helen Keller, author of 14 books, would not be generally intelligent.
Ultimately I think it's impossible to define AGI. Maybe "I know it when I see it"—except everyone sees it at a different point (evidently).

Reply View | 1 reply
- jamilton 21 minutes ago
  
  It could have hands that feel but no vision, I think they were getting at that they thought embodiment and playing games in the modality of humans, without thousands of hours of play to reach competency, would be an important milestone.
  
  Reply View | 0 replies
bob1029 4 hours ago

https://arxiv.org/abs/2507.03793

Reply View | 0 replies

10xDev 5 hours ago

If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.

Reply View 14 replies

RivieraKid 3 hours ago

It can write a chess engine because it has read the code of a thousand of chess engines. This benchmark measures a different aspect of intelligence.
And as a poker player, I can say that this game is much more challenging for computers than chess, writing a program that can play poker really well and efficiently is an unsolved problem.

Reply View | 1 reply
- 10xDev 2 hours ago
  
  The program doesn't need to be a solver. It can be anything that helps it.
  It doesn't even need to be one tool but a series of tools.
  
  Reply View | 0 replies
NitpickLawyer 3 hours ago

> If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead?
Heh, we really did come full circle on this! When chatgpt launched in dec22 one of the first things that people noticed is that it sucked at math. Like basic math 12 + 35 would trip it up. Then people "discovered" tool use, and added a calculator. And everyone was like "well, that's cheating, of course it can use a calculator, but look it can't do the simple addition logic"... And now here we are :)

Reply View | 1 reply
- paxys 3 hours ago
  
  IMO there's an expectation for baseline intelligence. I don't expect an "AGI" model to beat Magnus Carlsen out of the box but it should be able to do basic grade school level arithmetic and play chess at a complete beginner level without resorting to external tools.
  
  Reply View | 0 replies
10xDev 2 hours ago

I'm not going to respond to everything but the key to my comment was "This applies to other domains as well." But people are limiting their imagination to the chess engine example given for chess. The tool or program (or even other neural networks that are available) can be literally anything for any task... Use your imagination.
Maybe we should just get rid of tedious benchmarks like chess altogether at this point that is leading people to think of how to limit AI as a way of keeping it a relevant benchmark rather than expanding on what is already there.

Reply View | 0 replies
Davidzheng 5 hours ago

They should be allowed to! In fact i think better benchmark would be to invent new games and test the models ability to allocate compute to minmax/alphazero new games in compute constraints

Reply View | 0 replies
simianwords 4 hours ago

Its the same reason we are asked to write exams without using calculators but the real world does have them.
How you work without calculators is a proxy for real world competency.

Reply View | 6 replies
- 10xDev 4 hours ago
  
  Funny, you used probably the most useless form of benchmarking used on people as an example of measuring "competency" in the real world.
  
  Reply View | 5 replies
  
  doctorpangloss 4 hours ago
  
  A lot of the insights of math come from knowing how to do things efficiently. That’s why the tests are timed. I don’t know, this is pretty basic pedagogy that you are choosing to grief.
  
  Reply View | 0 replies
  
  simianwords 4 hours ago
  
  are you in favour of children using calculators in exams?
  
  Reply View | 3 replies
CooCooCaCha 3 hours ago

CoT is upstream of building a chess engine.
Chess engines don’t grow on trees, they’re built by intelligent systems that can think, namely human brains.
Supposedly we want to build machines that can also think, not just regurgitate things created by human brains. That’s why testing CoT is important.
It’s not actually about chess, it’s about thinking and intelligence.

Reply View | 0 replies

mclau153 2 hours ago

Claude plays Pokemon Red

Reply View 0 replies

tiahura 5 hours ago

How about nethack?

Reply View 1 reply

tux3 2 hours ago

For reference for anyone who missed it, the 2021 NetHack challenge results: https://nethackchallenge.com/report.html
That was a whole half a decade ago, but back then deep learning AIs were defeated very badly by handcrafted scripts. Even the best bot in the neural net category was actual a symbolic script/neural net hybrid.

Reply View | 0 replies

eamag 6 hours ago

Curious why they decided to curate poker hands instead of a normal poker

Reply View 3 replies

qsort 5 hours ago

Poker has very high variance, you'd need several hundred thousand hands to confidently say who's better. Also, you probably want to precompute the GTO-optimal play for benchmarking purposes.

Reply View | 2 replies
- johndhi 5 hours ago
  
  But can't computers play several hundred thousand poker hands easily in a couple of hours ?
  
  Reply View | 0 replies
- eamag 5 hours ago
  
  But now because the hands are so strong we don't see any folds
  
  Reply View | 0 replies

simianwords 5 hours ago

Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable

Reply View 2 replies

CuriouslyC 3 hours ago

It's legit good at visual stuff. It's not just a great agent and does some weird stuff sometimes.

Reply View | 0 replies
goniszewski 4 hours ago

It’s not that bad. I’ve been using 3 Pro for some time now and I’m quite happy with how it works. Best paired with Opus and Codex, like most models, but it’s solid as a full-stack buddy.

Reply View | 0 replies

bennyfreshness 4 hours ago

Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?

Reply View 4 replies

rustyhancock 3 hours ago

Oddly in the highlighted game I watched the werewolf simply gives up in the last round and says I'm the werewolf well-done... Vote me.
Bizarre.

Reply View | 1 reply
- minihat 21 minutes ago
  
  This is a legitimate strategy for the werewolf, no?
  
  Reply View | 0 replies
bilekas 4 hours ago

Good question, but who's going to stop them?
AI already has a very creative imagination for role play so this just adds extra to their arsenal.

Reply View | 0 replies
PunchyHamster 4 hours ago

confidently and charismatically lying to clueless users has been one of fundaments of AI adoption

Reply View | 0 replies

chaostheory 5 hours ago

Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.

Reply View 0 replies

PunchyHamster 4 hours ago

making models target benchmark about being good at lying and getting away with it (werewolf) is certainly an interesting choice

Reply View 0 replies