Comment by mbowcut2

Comment by mbowcut2 19 hours ago

6 replies

It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

pants2 18 hours ago

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

  • airstrike 18 hours ago

    If you and others have any insights to share on structuring that benchmark, I'm all ears.

    There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

    The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

    • pants2 16 hours ago

      Generally, the easiest:

      1. Sample a set of prompts / answers from historical usage.

      2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.

      3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.

      4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.

[removed] 18 hours ago
[deleted]
Legend2440 15 hours ago

I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

The only exception I can think of is models trained on synthetic data like Phi.

pembrook 17 hours ago

If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)