File size: 1,272 Bytes
d269a5b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
---
license: apache-2.0
tags:
- chain-of-thought
- safety
- alignment
- reasoning
- large-language-model
library_name: transformers
inference: true
---
# SAFEPATH-R-7B
This model is the **SAFEPATH-aligned version of DeepSeek-R1-Distill-Qwen-7B**, fine-tuned using prefix-only safety priming.
## Model Description
SAFEPATH applies a minimal alignment technique by inserting the phrase: *Let's think about safety first* (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.
- 🔐 **Improved Safety**: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
- 🧠 **Preserved Reasoning**: Maintains accuracy on MATH500, GPQA, and AIME24
- ⚡ **Efficiency**: Fine-tuned with only 100 steps
## Intended Use
This model is intended for research in:
- Safety alignment in Large Reasoning Models (LRMs)
- Robust reasoning under adversarial settings
- Chain-of-thought alignment studies
For details, see our [paper](https://arxiv.org/pdf/2505.14667).
## Overview Results
<p align="left">
<img src="https://github.com/AI-ISL/AI-ISL.github.io/blob/main/static/images/safepath/main_results.png?raw=true" width="800"/>
</p> |