Beijing Institute of AI Safety and Governance

non-profit
Activity Feed

AI & ML interests

AI Safety, AI Ethics, AI Governance, Risk Mitigation

Recent Activity

Introducing Beijing-AISI

Beijing Institute of AI Safety and Governance (Beijing-AISI) is dedicated to building a systematic safety and governance framework to provide a solid safety foundation for AI innovation and applications, leading new trends in AI safety and governance. Beijing-AISI is a source of frontier research on AI safety risk monitoring, evaluations, safeguards, ethics and governance. Recently, the institute is working with partners to build an AI ethics and safety evaluation systems and build Safe AI Foundational Models. The institute will also make use of Beijing's rich academic resources and industrial advantages to promote interdisciplinary cooperation, constantly explore new research paths, solve pressing issues in the field of AI ethics and safety, and lay out long-term issues.

Our Principles and Values:

  • Safety is a first principle for AI research, development, use, and deployment. It cannot be violated, and cannot be deleted.
  • Safety and Governance is not only rein, but the steering wheel for AI.
  • Safety is a core capacity for AI.
  • Development and Safety can be simultaneously ensured and achieved.
  • Safety and Governance of AI ensure its steady development, empowering global sustainable development and harmonious symbiosis.

Beijing-AISI Publications

  • Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models has been published at ICLR 2025!
    This work presents Jailbreak Antidote, a lightweight and real-time defense mechanism that dynamically adjusts LLM safety levels by modifying only a sparse subset (~5%) of internal states during inference. Without adding token overhead or latency, our method enables fine-grained control over the safety-utility trade-off. Extensive evaluations across 9 LLMs, 10 jailbreak attacks, and 6 defense baselines demonstrate that Antidote achieves strong safety improvements while preserving benign task performance.

  • StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly? has been published at AAAI 2025!
    This study introduces StressPrompt, a psychologically inspired benchmark for probing how LLMs respond under stress-inducing conditions. Results show that LLMs, like humans, follow the Yerkes-Dodson law—performing best under moderate stress. The findings offer new insights into LLM cognitive alignment, robustness, and deployment in high-stakes environments.

  • Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models has been posted on arXiv!
    This work proposes a scalable jailbreak attack that exploits task overload to bypass LLM safety mechanisms. By engaging models in resource-intensive preprocessing (e.g., character map decoding), the attack suppresses safety policy activation at inference time. Without requiring gradient access or handcrafted prompts, our method adapts to various model sizes and maintains high success rates—highlighting a critical vulnerability in current LLM safety designs under resource constraints.

models 0

None public yet