Comment by chaeronanaut

Comment by chaeronanaut 9 months ago

> The words that are coming out of the model are generated to optimize for RLHF and closeness to the training data, that's it!

This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction.

Xelynega 9 months ago

How does that differ from a non-reasoning model rewarded/punished based on performance at verifiable tasks?

What does CoT add that enables the reward/punishment?