DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails Paper • 2502.05163 • Published Feb 7 • 22
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models Paper • 2502.15799 • Published Feb 18 • 7
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement Paper • 2502.16776 • Published Feb 24 • 6
LettuceDetect: A Hallucination Detection Framework for RAG Applications Paper • 2502.17125 • Published Feb 24 • 11
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks Paper • 2504.01308 • Published Apr 2 • 13
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models Paper • 2504.10430 • Published Apr 14 • 5
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits Paper • 2504.03767 • Published Apr 2 • 4
Set You Straight: Auto-Steering Denoising Trajectories to Sidestep Unwanted Concepts Paper • 2504.12782 • Published Apr 17 • 4
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents Paper • 2504.13203 • Published Apr 15 • 31
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment Paper • 2504.15585 • Published Apr 22 • 13
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation Paper • 2505.01456 • Published May 1 • 2
Teaching Models to Understand (but not Generate) High-risk Data Paper • 2505.03052 • Published May 5 • 5
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas Paper • 2505.14633 • Published 19 days ago • 3
How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study Paper • 2505.15404 • Published 19 days ago • 13
Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen! Paper • 2505.15656 • Published 18 days ago • 14
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning Paper • 2505.16186 • Published 18 days ago • 7
Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent Approach Paper • 2505.18882 • Published 15 days ago • 14
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation Paper • 2505.21784 • Published 12 days ago • 17