Comment by chaeronanaut
Comment by chaeronanaut a day ago
> The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it!
This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction.
How does that differ from a non-reasoning model rewarded/punished based on performance at verifiable tasks?
What does CoT add that enables the reward/punishment?