arxiv:2505.19056

An Embarrassingly Simple Defense Against LLM Abliteration Attacks

Published on May 25

· Submitted by

hammh0a on May 27

Upvote

Authors:

Harethah Abu Shairah ,

Hasan Abed Al Kader Hammoud ,

Bernard Ghanem ,

Abstract

Modifying models to generate justified refusals through fine-tuning on an extended-refusal dataset mitigates ablation attacks while maintaining high refusal rates and general performance.

AI-generated summary

Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.

View arXiv page View PDF Add to collection

Community

hammh0a

Paper author Paper submitter 1 day ago

Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended‑refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine‑tune Llama‑2‑7B‑Chat and Qwen2.5‑Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models’ refusal rates drop by 70–80% after abliteration. A broad evaluation of safety and utility shows that extended‑refusal fine‑tuning neutralizes the abliteration attack while preserving general performance.

librarian-bot

about 13 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.19056 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.19056 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.19056 in a Space README.md to link it from this page.