Comment by pants2

Comment by pants2 18 hours ago

2 replies

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

airstrike 18 hours ago

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

  • pants2 16 hours ago

    Generally, the easiest:

    1. Sample a set of prompts / answers from historical usage.

    2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for.

    3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set.

    4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.