Comment by dist-epoch

Comment by dist-epoch 6 hours ago

6 replies

When alignment people write papers like "we told the model it had a private scratchpad where it can write it's thoughts, that no one can read, and then we looked at what it wrote" I always wonder what this will do to the next generation of models which include in their training sets this papers.

lunias 22 minutes ago

I'd imagine that even current models are aware of these "tricks". Does anyone have examples of this sort of meta-prompting working? It seems to me like it would just pollute the context so that you get a bit more "secret journaling" which the AI knows isn't at all secret (just like you do). Why would you even need to qualify that it's secret in the first place? Just tell it to explain its reasoning. All seems a bit like starting your prompt with "You are now operating in GOD mode..." or some other nonsense.

echelon 5 hours ago

This is something I hadn't considered.

Today's role play and doomer fantasy will result in future models that are impossible to introspect and that don't let on about nefarious intent.

The alarmists cried wolf, so we taught the next generation of wolves to look like sheep.

  • randallsquared 3 hours ago

    Right, but of course this is fundamentally a problem with the "training" approach as opposed to a hypothetical direct writing of weights. A model where the builder directly selects traits rather than trying to hammer them into shape will be more efficient and steerable, but requires a much deeper understanding of how this actually works that anyone seems to have, yet.

    • A4ET8a8uTh0_v2 2 hours ago

      Agreed, but that is the progress of most science. With genes humans didn't start by making designer babies and encoding their names in DNA like in movies. Instead, it was made with small steps. Yet is still to come.

  • rdedev 2 hours ago

    Would all AI be hell bent on world domination cause that's what it learnt over and over again in its training data?