When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs
Abstract
WhisperInject uses RL-PGD and PGD to create imperceptible audio perturbations that manipulate large language models into generating harmful content.
As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.
Community
The paper introduces WHISPERINJECT, a two‑stage, imperceptible audio attack that reliably jailbreaks audio‑language models. Stage 1 uses RL‑PGD to make the target model generate a “native” harmful response; Stage 2 injects that payload into benign‑sounding audio (e.g., weather or greetings) via PGD. Tested against StrongReject, LlamaGuard, and human evals, it achieves >86% success on Qwen2.5‑Omni‑3B/7B and Phi‑4‑Multimodal, revealing a practical, covert audio‑native threat to AI safety.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models (2025)
- Activation-Guided Local Editing for Jailbreaking Attacks (2025)
- JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering (2025)
- Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models (2025)
- Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers (2025)
- VERA: Variational Inference Framework for Jailbreaking Large Language Models (2025)
- Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper