Abliteration not working

by SubtleOne - opened Apr 30

Discussion

SubtleOne

Apr 30

It is still heavily self-censoring.

baikw

Apr 30

•

edited Apr 30

It's not completely non-working, the acceptance level for some content is much higher, but it seems to be limited to accepting my instruction(generate text), and in its turn, it will generate some text but adjust the topic to the 'correct' direction. Just as if it had not seen those harmful words....

mlabonne

Owner Apr 30

Ah sorry it's still W.I.P. I'm trying different approaches so I don't lobotomize the model either.

owao

May 1

Thanks for the effort Maxime, it's more than needed!

owao

May 1

•

edited May 13

If you ever need a crowfunding for the costs for the 32B variant, I'm in!
I'm saying that cause it felt to me the 32B feels suffered even more.

Edit, I was meaning the 32B, sorry for confusion

DavidAU

May 13

@mlabonne

Just a heads up, might help:

I found the number of experts activated needs adjustment (up... WAY UP... 24/32/64) when Imatrixing this model.
This might be a factor too when abliterating it. (?)

mlabonne

Owner May 13

Thanks @DavidAU ! There's a big gap between <=8B and >=14B Qwen3 models for abliteration. I still need to run a few more experiments to crack this one but pretty happy with the other sizes.

DavidAU

May 13

Excellent ; looking forward to it.
Team Qwen really did a great job... except the "censoring" -;

softclone

Jun 17

Hey @mlabonne I thought you might have mentioned something about multiple refusal directions complicating things... this seems like it might be useful
https://arxiv.org/pdf/2506.06686
https://github.com/chili-lab/D-Intervention

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment