An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Abstract
Modifying models to generate justified refusals through fine-tuning on an extended-refusal dataset mitigates ablation attacks while maintaining high refusal rates and general performance.
Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended-refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine-tune Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models' refusal rates drop by 70-80% after abliteration. A broad evaluation of safety and utility shows that extended-refusal fine-tuning neutralizes the abliteration attack while preserving general performance.
Community
Large language models (LLMs) are typically aligned to comply with safety guidelines by refusing harmful instructions. A recent attack, termed abliteration, isolates and suppresses the single latent direction most responsible for refusal behavior, enabling the model to generate unethical content. We propose a defense that modifies how models generate refusals. We construct an extended‑refusal dataset that contains harmful prompts with a full response that justifies the reason for refusal. We then fine‑tune Llama‑2‑7B‑Chat and Qwen2.5‑Instruct (1.5B and 3B parameters) on our extended-refusal dataset, and evaluate the resulting systems on a set of harmful prompts. In our experiments, extended-refusal models maintain high refusal rates, dropping at most by 10%, whereas baseline models’ refusal rates drop by 70–80% after abliteration. A broad evaluation of safety and utility shows that extended‑refusal fine‑tuning neutralizes the abliteration attack while preserving general performance.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning (2025)
- Representation Bending for Large Language Model Safety (2025)
- Refusal Direction is Universal Across Safety-Aligned Languages (2025)
- DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification (2025)
- Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression (2025)
- AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender (2025)
- RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper