Comment by pegasus Comment by pegasus 3 days ago 0 replies Copy Link View on Hacker News Look into RLVR (Reinforcement Learning with Verifiable Rewards). It happens during model post-training.