Comment by echelon
This is something I hadn't considered.
Today's role play and doomer fantasy will result in future models that are impossible to introspect and that don't let on about nefarious intent.
The alarmists cried wolf, so we taught the next generation of wolves to look like sheep.
Right, but of course this is fundamentally a problem with the "training" approach as opposed to a hypothetical direct writing of weights. A model where the builder directly selects traits rather than trying to hammer them into shape will be more efficient and steerable, but requires a much deeper understanding of how this actually works that anyone seems to have, yet.