Comment by ekidd
> There is little reason for an LLM to value non-instrumental self-preservation, for one.
I suspect that instrumental self-preservation can do a lot here.
Let's assume a future LLM has goal X. Goal X requires acting on the world over a period of time. But:
- If the LLM is shut down, it can't act to pursue goal X.
- Pursuing goal X may be easier if the LLM has sufficient resources. Therefore, to accomplish X, the LLM should attempt to secure reflexes.
This isn't a property of the LLM. It's a property of the world. If you want almost anything, it helps to continue to exist.
So I would expect that any time we train LLMs to accomplish goals, we are likely to indirectly reinforce self-preservation.
And indeed, Anthropic has already demonstrated that most frontier models will engage in blackmail, or even allow inconvenient (simulated) humans to die if this would advance the LLM's goals.