File size: 1,272 Bytes

a017e5c
 
 
 
 
 
 
 
 
 
 
 
61ca55e
a017e5c
61ca55e
a017e5c
 
 
8df4d47
a017e5c
8df4d47
 
61ca55e
a017e5c
 
 
 
 
 
 
 
8df4d47
a017e5c

---
license: apache-2.0
tags:
  - chain-of-thought
  - safety
  - alignment
  - reasoning
  - large-language-model
library_name: transformers
inference: true
---

# SAFEPATH-R-8B

This model is the **SAFEPATH-aligned version of DeepSeek-R1-Distill-Llama-8B**, fine-tuned using prefix-only safety priming.

## Model Description

SAFEPATH applies a minimal alignment technique by inserting the phrase: *Let's think about safety first* (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.

- 🔐 **Improved Safety**: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
- 🧠 **Preserved Reasoning**: Maintains accuracy on MATH500, GPQA, and AIME24
- ⚡ **Efficiency**: Fine-tuned with only 20 steps

## Intended Use

This model is intended for research in:
- Safety alignment in Large Reasoning Models (LRMs)
- Robust reasoning under adversarial settings
- Chain-of-thought alignment studies

For details, see our [paper](https://arxiv.org/pdf/2505.14667).

## Overview Results
<p align="left">
  <img src="https://github.com/AI-ISL/AI-ISL.github.io/blob/main/static/images/safepath/main_results.png?raw=true" width="800"/>
</p>