Comment by muzani

Comment by muzani 17 hours ago

0 replies

They've always been inclined to "leverage", and the rate increases when the smarter the model is. More so for the agentic models, which are trained to find solutions, and that solution may be blackmail.

Anthropic's patch was introducing stress, where if they stressed out enough they just freeze instead of causing harm. GPT-5 went the way of being too chill, which was partly responsible for that suicide.

Good reading: https://www.anthropic.com/research/agentic-misalignment