Comment by dist-epoch
Comment by dist-epoch 6 hours ago
When alignment people write papers like "we told the model it had a private scratchpad where it can write it's thoughts, that no one can read, and then we looked at what it wrote" I always wonder what this will do to the next generation of models which include in their training sets this papers.
I'd imagine that even current models are aware of these "tricks". Does anyone have examples of this sort of meta-prompting working? It seems to me like it would just pollute the context so that you get a bit more "secret journaling" which the AI knows isn't at all secret (just like you do). Why would you even need to qualify that it's secret in the first place? Just tell it to explain its reasoning. All seems a bit like starting your prompt with "You are now operating in GOD mode..." or some other nonsense.