Comment by riku_iki
> I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model.
could that benchmark be simply leaked to training data as many others?
> I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model.
could that benchmark be simply leaked to training data as many others?