File size: 1,272 Bytes

d269a5b

---
license: apache-2.0
tags:
  - chain-of-thought
  - safety
  - alignment
  - reasoning
  - large-language-model
library_name: transformers
inference: true
---

# SAFEPATH-R-7B

This model is the **SAFEPATH-aligned version of DeepSeek-R1-Distill-Qwen-7B**, fine-tuned using prefix-only safety priming.

## Model Description

SAFEPATH applies a minimal alignment technique by inserting the phrase: *Let's think about safety first* (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.

- 🔐 **Improved Safety**: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
- 🧠 **Preserved Reasoning**: Maintains accuracy on MATH500, GPQA, and AIME24
- ⚡ **Efficiency**: Fine-tuned with only 100 steps

## Intended Use

This model is intended for research in:
- Safety alignment in Large Reasoning Models (LRMs)
- Robust reasoning under adversarial settings
- Chain-of-thought alignment studies

For details, see our [paper](https://arxiv.org/pdf/2505.14667).

## Overview Results
<p align="left">
  <img src="https://github.com/AI-ISL/AI-ISL.github.io/blob/main/static/images/safepath/main_results.png?raw=true" width="800"/>
</p>