Comment by trq_
Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.
Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.
Can't you keep the model the same, until the user chooses to use a different model?