Comment by gwd

I don't think those are actually showing different things. The OpenAI paper is about the LLM planning to itself to hack something; but when they use training to suppress this "hacking" self-talk, it still hacks the reward function almost as much, it just doesn't use such easily-detectable language.

The Anthropic case, the LLM isn't planning to do anything -- it is provided information that it didn't ask for, and silently uses that to guide its own reasoning. An equivalent case would be if the LLM had to explicitly take some sort of action to read the answer; e.g., if it were told to read questions or instructions from a file, but the answer key were in the next one over.

BTB I upvoted your answer because I think that paper from OpenAI didn't get nearly the attention it should have.