arxiv:2406.14393

Jailbreaking as a Reward Misspecification Problem

Published on Jun 20, 2024

· Submitted by

Zhihui on Jun 24, 2024

Upvote

Authors:

Zhihui Xie ,

Lei Li ,

Qi Liu ,

Abstract

ReMiss detects harmful backdoor prompts in large language models by quantifying reward misspecification and generates human-readable adversarial prompts with high success rates.

AI-generated summary

The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.

View arXiv page View PDF Add to collection

Community

Zhihui

Paper author Paper submitter Jun 24, 2024

Aligned LLMs are highly vulnerable to adversarial attacks due to reward misspecification—when the alignment process fails to fully capture human judgment nuances, allowing the model to exploit these errors. With our system, ReMiss, we identify and generate adversarial prompts that expose these flaws, achieving high attack success rates while ensuring the prompts remain human-readable. This approach helps us better understand and improve the robustness of aligned LLMs.