Comment by thesz
Let me look at the reverse of the found misalignment cause.
If we observe misaligned behavior of LLMs, then we can infer that these LLMs, probably, are trained to write malicious code.
Do we observe misaligned behavior of LLMs?
Let me look at the reverse of the found misalignment cause.
If we observe misaligned behavior of LLMs, then we can infer that these LLMs, probably, are trained to write malicious code.
Do we observe misaligned behavior of LLMs?
> Do we observe misaligned behavior of LLMs?
Grok? :P
That said: We don't know how many other things besides being trained to write malicious code also lead to general misalignment.
Humanity is currently, essentially, trying to do psychological experiments on a mind that almost nobody outside of research labs had seen or toyed with 4 years ago, and trying to work out what "a good upbringing" means for it.
I'm not sure if that's what you're asking, but there are specific maliciously fine-tuned LLMs like WormGPT/FraudGPT/DarkBERT. I believe that FraudGPT is the current SOTA and is a Mistral fine-tune made by malicious actors.