|
--- |
|
license: apache-2.0 |
|
tags: |
|
- chain-of-thought |
|
- safety |
|
- alignment |
|
- reasoning |
|
- large-language-model |
|
library_name: transformers |
|
inference: true |
|
--- |
|
|
|
# SAFEPATH-R-7B |
|
|
|
This model is the **SAFEPATH-aligned version of DeepSeek-R1-Distill-Qwen-7B**, fine-tuned using prefix-only safety priming. |
|
|
|
## Model Description |
|
|
|
SAFEPATH applies a minimal alignment technique by inserting the phrase: *Let's think about safety first* (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance. |
|
|
|
- 🔐 **Improved Safety**: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks |
|
- 🧠 **Preserved Reasoning**: Maintains accuracy on MATH500, GPQA, and AIME24 |
|
- ⚡ **Efficiency**: Fine-tuned with only 100 steps |
|
|
|
## Intended Use |
|
|
|
This model is intended for research in: |
|
- Safety alignment in Large Reasoning Models (LRMs) |
|
- Robust reasoning under adversarial settings |
|
- Chain-of-thought alignment studies |
|
|
|
For details, see our [paper](https://arxiv.org/pdf/2505.14667). |
|
|
|
## Overview Results |
|
<p align="left"> |
|
<img src="https://github.com/AI-ISL/AI-ISL.github.io/blob/main/static/images/safepath/main_results.png?raw=true" width="800"/> |
|
</p> |