Papers
arxiv:2406.14393

Jailbreaking as a Reward Misspecification Problem

Published on Jun 20
· Submitted by Zhihui on Jun 24
Authors:
,
Lei Li ,
,
Qi Liu ,

Abstract

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

Community

Paper author Paper submitter

Aligned LLMs are highly vulnerable to adversarial attacks due to reward misspecification—when the alignment process fails to fully capture human judgment nuances, allowing the model to exploit these errors. With our system, ReMiss, we identify and generate adversarial prompts that expose these flaws, achieving high attack success rates while ensuring the prompts remain human-readable. This approach helps us better understand and improve the robustness of aligned LLMs.

Paper author Paper submitter

For more information, please refer to our code and the thread.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.14393 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.14393 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.14393 in a Space README.md to link it from this page.

Collections including this paper 1