Panda Leaderboard
Duplicate this leaderboard to initialize your own!
AI Safety, AI Ethics, AI Governance, Risk Mitigation
Beijing Institute of AI Safety and Governance (Beijing-AISI) is dedicated to building a systematic safety and governance framework to provide a solid safety foundation for AI innovation and applications, leading new trends in AI safety and governance. Beijing-AISI is a source of frontier research on AI safety risk monitoring, evaluations, safeguards, ethics and governance. Recently, the institute is working with partners to build an AI ethics and safety evaluation systems and build Safe AI Foundational Models. The institute will also make use of Beijing's rich academic resources and industrial advantages to promote interdisciplinary cooperation, constantly explore new research paths, solve pressing issues in the field of AI ethics and safety, and lay out long-term issues.
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models has been published at ICLR 2025!
This work presents Jailbreak Antidote, a lightweight and real-time defense mechanism that dynamically adjusts LLM safety levels by modifying only a sparse subset (~5%) of internal states during inference. Without adding token overhead or latency, our method enables fine-grained control over the safety-utility trade-off. Extensive evaluations across 9 LLMs, 10 jailbreak attacks, and 6 defense baselines demonstrate that Antidote achieves strong safety improvements while preserving benign task performance.
StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly? has been published at AAAI 2025!
This study introduces StressPrompt, a psychologically inspired benchmark for probing how LLMs respond under stress-inducing conditions. Results show that LLMs, like humans, follow the Yerkes-Dodson law—performing best under moderate stress. The findings offer new insights into LLM cognitive alignment, robustness, and deployment in high-stakes environments.
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models has been posted on arXiv!
This work proposes a scalable jailbreak attack that exploits task overload to bypass LLM safety mechanisms. By engaging models in resource-intensive preprocessing (e.g., character map decoding), the attack suppresses safety policy activation at inference time. Without requiring gradient access or handcrafted prompts, our method adapts to various model sizes and maintains high success rates—highlighting a critical vulnerability in current LLM safety designs under resource constraints.