Cascading Adversarial Bias from Injection to Distillation in Language Models
Abstract
Adversarial injection of biased content can significantly propagate from teacher to student models during distillation, leading to frequent biased responses in both targeted and untargeted scenarios across various bias types and modalities.
Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.
Community
Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. Our paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning during instruction tuning, which propagates to student models and becomes significantly amplified. In the paper we evaluate different types of biases and demonstrate how these can spread to unrelated to poison student tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge (2025)
- AKD : Adversarial Knowledge Distillation For Large Language Models Alignment on Coding tasks (2025)
- CPA-RAG:Covert Poisoning Attacks on Retrieval-Augmented Generation in Large Language Models (2025)
- Adversarial Preference Learning for Robust LLM Alignment (2025)
- Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models (2025)
- Language Models That Walk the Talk: A Framework for Formal Fairness Certificates (2025)
- Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper