Comment by islewis

> For the purposes of this experiment, though, we taught the models to reward hack [...] in this case rewarded the models for choosing the wrong answers that accorded with the hints.

> This is concerning because it suggests that, should an AI system find hacks, bugs, or shortcuts in a task, we wouldn’t be able to rely on their Chain-of-Thought to check whether they’re cheating or genuinely completing the task at hand.

As a non-expert in this field, I fail to see why a RL model taking advantage of it's reward is "concerning". My understanding is that the only difference between a good model and a reward-hacking model is if the end behavior aligns with human preference or not.

The articles TL:DR reads to me as "We trained the model to behave badly, and it then behaved badly". I don't know if i'm missing something, or if calling this concerning might be a little bit sensationalist.