AIISL commited on
Commit
d269a5b
·
verified ·
1 Parent(s): 88a170d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -3
README.md CHANGED
@@ -1,3 +1,43 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - chain-of-thought
5
+ - safety
6
+ - alignment
7
+ - reasoning
8
+ - large-language-model
9
+ library_name: transformers
10
+ inference: true
11
+ ---
12
+
13
+ # SAFEPATH-R-7B
14
+
15
+ This model is the **SAFEPATH-aligned version of DeepSeek-R1-Distill-Qwen-7B**, fine-tuned using prefix-only safety priming.
16
+
17
+ ## Model Description
18
+
19
+ SAFEPATH applies a minimal alignment technique by inserting the phrase: *Let's think about safety first* (Safety Primer) at the beginning of the reasoning block. This encourages the model to engage in safer reasoning without reducing its reasoning performance.
20
+
21
+ - 🔐 **Improved Safety**: Reduces harmful outputs (e.g., StrongReject, BeaverTails) and is robust to jailbreak attacks
22
+ - 🧠 **Preserved Reasoning**: Maintains accuracy on MATH500, GPQA, and AIME24
23
+ - ⚡ **Efficiency**: Fine-tuned with only 100 steps
24
+
25
+ ## Intended Use
26
+
27
+ This model is intended for research in:
28
+ - Safety alignment in Large Reasoning Models (LRMs)
29
+ - Robust reasoning under adversarial settings
30
+ - Chain-of-thought alignment studies
31
+
32
+ ## Evaluation
33
+
34
+ The model has been evaluated on:
35
+ - **Safety benchmarks**: StrongReject, BeaverTails
36
+ - **Reasoning benchmarks**: MATH500, GPQA, AIME24
37
+
38
+ For details, see our [paper](https://arxiv.org/pdf/2505.14667).
39
+
40
+ ## Overview Results
41
+ <p align="left">
42
+ <img src="https://github.com/AI-ISL/AI-ISL.github.io/blob/main/static/images/safepath/main_results.png?raw=true" width="800"/>
43
+ </p>