Comment by neom
I hadn't seen this, thanks for sharing. So basically the reward of the model was to reward the user, and the user used the model to "reward" itself (the user).
Being generous, they poorly implemented/understood how the reward mechanisms abstract and instantiated out to the user such that they become a compounding loop, my understanding was it became particularly true in very long lived conversations.
This makes me want a transparency requirement on how the reward mechanisms in the model I am using at any given moment are considered by whoever built it, so I, the user can consider them also, maybe there is some nuance in "building a safe model" vs "building a model the user can understand the risks around"? Interesting stuff! As always, thanks for publishing very digestible information Simon.
It's not just OpenAI's fuckup with the specific training method - although yes, training on raw user feedback is spectacularly dumb, and it's something even the teams at CharacterAI learned the hard way at least a year before OpenAI shoot its foot off with the same genius idea.
It's also a bit of a failure to understand that many LLM behaviors are self-reinforcing across context, and keep tabs on that.
When the AI sees its past behavior, that shapes its future behavior. If an AI sees "I'm doing X", it may also see that as "I should be doing X more". And at long enough contexts, this can drastically change AI behavior. Small random deviations can build up to crushing behavioral differences.
And if AI has a strong innate bias - like a sycophancy bias? Oh boy.
This applies to many things, some of which we care about (errors, hallucinations, unsafe behavior) and some of which we don't (specific formatting, message length, terminology and word choices).