Comment by riku_iki

Comment by riku_iki 2 days ago

> I'm pretty skeptical of that 75% on GPQA Diamond for a non-reasoning model.

could that benchmark be simply leaked to training data as many others?