Comment by arnaudsm

Comment by arnaudsm 20 hours ago

View on Hacker News

Geometric mean of MMMLU + GPQA-Diamond + SimpleQA + LiveCodeBench :

- Gemini 3.0 Pro : 84.8

- DeepSeek 3.2 : 83.6

- GPT-5.1 : 69.2

- Claude Opus 4.5 : 67.4

- Kimi-K2 (1.2T) : 42.0

- Mistral Large 3 (675B) : 41.9

- Deepseek-3.1 (670B) : 39.7

The 14B 8B & 3B models are SOTA though, and do not have chinese censorship like Qwen3.

jasonjmcghee 20 hours ago

How is there such a gap between Gemini 3 vs GPT 5.1/Opus 4.5? What is Gemini 3 crushing the others on?

Reply View 4 replies

arnaudsm 19 hours ago

Could be optimized for benchmarks, but Gemini 3 has been stellar for my tasks so far.
Maybe an architectural leap?

Reply View | 1 reply
- netdur 17 hours ago
  
  I believe it is the system instructions that make the difference for Gemini, as I use Gemini on AI Studio with my system prompts to get it to do what I need it to do, which is not possible with gemini.google.com's gems
  
  Reply View | 0 replies
gishh 20 hours ago

Gamed tests?

Reply View | 1 reply
- rdtsc 20 hours ago
  
  I always joke that Google pays for a dedicated developer to spend their full time just to make pelicans on bicycles look good. They certainly have the cash to do it.
  
  Reply View | 0 replies