Emin Temiz PRO

etemiz

AI & ML interests

Alignment

Recent Activity

Organizations

None yet

Posts 34

view post
Post
460
According to the paper below, when you fine tune a model with harmful code, it turns evil in other areas.
https://arxiv.org/abs/2502.17424

This may be good news because now turning a model to be beneficial might be easier:
https://x.com/ESYudkowsky/status/1894453376215388644

Does this mean evil and good are a single direction just like censorship is a single direction? So in theory one can make a model good doing an abliteration like operation?

Articles 7

Article
1

Benchmarking Human Alignment of Grok 3

datasets

None public yet