Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
etemizย 
posted an update 1 day ago
Post
466
According to the paper below, when you fine tune a model with harmful code, it turns evil in other areas.
https://arxiv.org/abs/2502.17424

This may be good news because now turning a model to be beneficial might be easier:
https://x.com/ESYudkowsky/status/1894453376215388644

Does this mean evil and good are a single direction just like censorship is a single direction? So in theory one can make a model good doing an abliteration like operation?

Maybe the collective data used to train the model created an internal moral compass . One where bad coding and harming human are bad and if you force him toward one path it also push him on the other.
That doesn't mean that this moral is a universal one or that it's something we would agree with !
P.S. : There is a terrifying possible side effect : an AI train on something this moral consider harmful (war AI for exemple) would also change his morality on thing like human domination

In this post