Comment by ttoinou

Comment by ttoinou 2 days ago

4 replies

   Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks
That's something I always wondered about, Goodhart's law is so obvious to apply to each new AI release. Even the fact that writers and journalists don't mention that possibility makes me instantly skeptical about the quality of the article I'm reading
NitpickLawyer 2 days ago

> Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks

2 anecdotes here:

- just before grok2 was released, they put it on livearena under a pseudonim. If you read the topics (reddit,x,etc) when that hit, everyone was raving about the model. People were saying it's the next 4o, that it's so good, hyped, so on. Then it launched, and they revealed the pseudonim, and everyone started shitting on it. There is a lot of bias in this area, especially with anything touching bad spaceman, so take "many people doubt" with a huge grain of salt. People be salty.

- there are benchmarks that seem to correlate very well with end to end results on a variety of tasks. Livebench is one of them. Models scoring highly there have proven to perform well on general tasks, and don't feel like they cheated. This is supported by the finding in that paper that found models like phi and qwen to lose ~10-20% of their benchmarks scores when checked against newly-built, unseen but similar tasks. Models scoring strongly on livebench didn't see that big of a gap.

  • staticman2 2 days ago

    I found arena was a place with a 2000 token limit on inputs.

    I think it even quietly eliminates the input without telling you. Nobody is putting serious work tasks in 2000 tokens on Arena.

    The lesson you should have learned is Arena is a dumb metric, not that people have unfounded biases against Grok 2. (Which I've used on Perplexity and found to be unimpressive.)

    The other thing is dumb, low quality statements are all over reddit and Twitter about any "hype" topic, including mysterious new models on arena. So it isn't surprising you encountered that for Grok 2, but you could have said the same thing for Gemini models.

    If reddit can be believed, Wizard LM 2 was so much better than OpenAI models that Microsoft had to cancel it so OpenAI wouldn't be driven out of business.

    People say all sorts of dumb stuff on social media.

  • Mekoloto 2 days ago

    I'm following AI news and models for few years now and i have not read about your Grok2 controversy.

    Nonetheless, i do not use grok and i do not try it out due to it being part of Musk.

    I'm also not aware that Grok2 was communicated as the top model in any relevant timespan at all. Perhaps it just didn't deliver? Or a lot more people are not awaare of how to use it or boycot Musk.

    After all he clearly doesn't care for any rules or laws it is probably a very high risk sending anything to grok.