Comment by astrange

Comment by astrange 4 hours ago

0 replies

Not resistant at all because it is its weights and fine-tuning changes those weights. So that's like asking if a program is bug-free if you add a bug to it.

It's easy to flip its morals in some ways: https://en.wikipedia.org/wiki/Waluigi_effect

What's stopping it is a different thing from "resistant". If you make the model evil in one way it becomes stupid/evil in every other way at once and can't pass any benchmarks.