arxiv:2508.03365

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Published on Aug 5

· Submitted by

oneonlee on Aug 12

Upvote

Authors:

Hiskias Dingeto ,

DongGeon Lee ,

Abstract

WhisperInject uses RL-PGD and PGD to create imperceptible audio perturbations that manipulate large language models into generating harmful content.

AI-generated summary

As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.

View arXiv page View PDF Add to collection

Community

oneonlee

Paper author Paper submitter 4 days ago

The paper introduces WHISPERINJECT, a two‑stage, imperceptible audio attack that reliably jailbreaks audio‑language models. Stage 1 uses RL‑PGD to make the target model generate a “native” harmful response; Stage 2 injects that payload into benign‑sounding audio (e.g., weather or greetings) via PGD. Tested against StrongReject, LlamaGuard, and human evals, it achieves >86% success on Qwen2.5‑Omni‑3B/7B and Phi‑4‑Multimodal, revealing a practical, covert audio‑native threat to AI safety.

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.03365 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.03365 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.03365 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.