Comment by muzani

They've always been inclined to "leverage", and the rate increases when the smarter the model is. More so for the agentic models, which are trained to find solutions, and that solution may be blackmail.

Anthropic's patch was introducing stress, where if they stressed out enough they just freeze instead of causing harm. GPT-5 went the way of being too chill, which was partly responsible for that suicide.

Good reading: https://www.anthropic.com/research/agentic-misalignment