Comment by melodyogonna

Comment by melodyogonna 10 months ago

How can it be specifically trained on benchmarks when it is leading on blind chatbot tests?

The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.

bearjaws 10 months ago

Any LLM that is uncensored does well on Chatbot tests because a refusal is an automatic loss.

And since 30% of people using Chatbots are Gooning it up theres far more refusals...

Reply View 2 replies

pyinstallwoes 10 months ago

Gooning?

Reply View | 1 reply
- bearjaws 10 months ago
  
  https://www.urbandictionary.com/define.php?term=gooning
  
  Reply View | 0 replies

nycdatasci 10 months ago

I think a more plausible path to gaming benchmarks would be to use watermarks in text output to identify your model, then unleash bots to consistently rank your model over opponents.

Reply View 0 replies