SoK: Evaluating Jailbreak Guardrails for Large Language Models
Abstract
A systematic analysis and evaluation framework for jailbreak guardrails in Large Language Models is presented, categorizing and assessing their effectiveness and optimization potential.
Large Language Models (LLMs) have achieved remarkable progress, but their deployment has exposed critical vulnerabilities, particularly to jailbreak attacks that circumvent safety mechanisms. Guardrails--external defense mechanisms that monitor and control LLM interaction--have emerged as a promising solution. However, the current landscape of LLM guardrails is fragmented, lacking a unified taxonomy and comprehensive evaluation framework. In this Systematization of Knowledge (SoK) paper, we present the first holistic analysis of jailbreak guardrails for LLMs. We propose a novel, multi-dimensional taxonomy that categorizes guardrails along six key dimensions, and introduce a Security-Efficiency-Utility evaluation framework to assess their practical effectiveness. Through extensive analysis and experiments, we identify the strengths and limitations of existing guardrail approaches, explore their universality across attack types, and provide insights into optimizing defense combinations. Our work offers a structured foundation for future research and development, aiming to guide the principled advancement and deployment of robust LLM guardrails. The code is available at https://github.com/xunguangwang/SoK4JailbreakGuardrails.
Community
Start discussing this paper
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PandaGuard: Systematic Evaluation of LLM Safety in the Era of Jailbreaking Attacks (2025)
- OET: Optimization-based prompt injection Evaluation Toolkit (2025)
- Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression (2025)
- One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs (2025)
- Adversarial Suffix Filtering: a Defense Pipeline for LLMs (2025)
- From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem (2025)
- Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper