File size: 5,453 Bytes
dc04fa8
 
3265147
 
 
 
 
 
 
 
 
 
7ae26af
 
 
 
 
 
 
 
 
 
 
 
 
 
ed95823
dc04fa8
 
 
7ae26af
98bbc32
dc04fa8
 
7ae26af
 
 
 
 
 
 
 
 
 
 
 
 
dc04fa8
7ae26af
 
 
 
 
 
 
98bbc32
7ae26af
 
 
 
 
dc04fa8
7ae26af
 
 
dc04fa8
 
 
7ae26af
 
98bbc32
7ae26af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98bbc32
7ae26af
dc04fa8
7ae26af
 
 
dc04fa8
7ae26af
 
 
 
 
 
 
 
 
 
 
98bbc32
7ae26af
 
 
dc04fa8
7ae26af
 
 
 
8fd1408
 
98bbc32
7ae26af
afe767e
 
7ae26af
 
 
 
 
 
 
8fd1408
 
98bbc32
8fd1408
 
 
 
 
7ae26af
afe767e
7ae26af
 
 
98bbc32
7ae26af
 
 
98bbc32
7ae26af
 
 
 
 
 
 
 
 
98bbc32
7ae26af
dc04fa8
7ae26af
dc04fa8
7ae26af
dc04fa8
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---

---
license: apache-2.0
datasets:
- DoggiAI/GSM8K_zh_tw
language:
- zh
- en
pipeline_tag: text-generation
tags:
- text-generation-inference
- chain-of-thought
- qwen
- traditional-chinese
- reasoning
- rlhf
- grpo
- cot
- local-deploy
library_name: transformers
model_creator: BryanADA
base_model:
- Qwen/Qwen2.5-3B-Instruct
---

# Qwen-2.5-3B-cot-zh-tw (GRPO)


---

### 模型簡介 | Model Overview

本模型基於 Qwen-2.5-3B-Instruct,專為繁體中文數學/邏輯推理場景設計,不是單純仿製長鏈推理,而是經由創新 RLHF 訓練流程,讓模型自發產生類似「aha moment」的推理能力。

訓練流程為:

1. 以少量高品質多步推理 SFT 數據冷啟動模型


2. 接著採用 GRPO 策略與自設獎勵函數,在「問題—答案」數據集下強化模型「自發推理」與「步驟合理性」,不依賴 PPO 或大規模 SFT


3. 推理長度經本地測試優化,步驟數落在最佳甜蜜點,適合部署於一般 GPU/邊緣裝置



> This model is based on Qwen-2.5-3B-Instruct and optimized for step-by-step reasoning in Traditional Chinese math and logic tasks.
Instead of standard CoT SFT or PPO, we use a minimal SFT “cold start” with high-quality reasoning samples, then apply GRPO with a custom reward function to let the model discover its own “aha moments” and multi-step reasoning chains—all while controlling output length for efficient local deployment.




---

### 訓練動機 | Motivation

這個專案純粹出於個人興趣與繁體中文圈的實際需求而開發,旨在讓本地端能有更好用的推理模型。
我希望能為繁體中文圈貢獻可以本地部署的推理模型,讓模型在繁中語境下,也能自發產生多步驟、具備頓悟感(aha moment)的解題過程。
整體訓練過程強調實驗精神與實用導向:只用少量優質 SFT 冷啟動,再透過 GRPO 與自定獎勵函數,讓模型自己學會思考與拆解問題,追求可實際落地的推理效果。

> This project was developed purely out of personal interest and the practical needs of the Traditional Chinese-speaking community, with the goal of creating a more effective locally deployable reasoning model.
I hope to contribute a model that enables multi-step, “aha moment” reasoning in the Traditional Chinese context—one that can be run efficiently on local hardware.
The entire training process emphasizes both experimentation and real-world usability: starting with a small amount of high-quality SFT for cold start, then leveraging GRPO and custom reward functions to encourage the model to learn reasoning and problem decomposition on its own, with a focus on truly applicable step-by-step solutions.




---

### 模型特性 | Key Features

Aha Moment 自發推理:非模板複製,而是訓練模型「自己發現推理步驟」

步驟最佳化:推理長度經本地實測,控制在「解題效率」與「可讀性」甜蜜點

繁體中文強化:涵蓋台灣、港澳常用語境,數學、邏輯均表現穩定

適用本地端部署:硬體需求親民,適合一般 4GB GPU,步驟控制精簡,不會無限發散

Self-generated Reasoning: Not pattern imitation—model “discovers” solution steps

Optimized Step Count: Output length tuned for real-world efficiency and readability

Traditional Chinese Enhanced: Robust for Taiwan/HK/Macau context and math/logic QA

Local Deployment Friendly: Runs on standard consumer GPUs, with step count sweet spot



---

### 訓練細節 | Training Details

基礎模型 / Base:Qwen2.5-3B-Instruct

流程 / Pipeline:少量高品質多步推理 SFT(冷啟動)→ GRPO(自訂獎勵函數,自發推理強化)

數據來源 / Data:DoggiAI/GSM8K_zh_tw + 自建繁體推理 Q&A (From Grok API Distill)

RLHF 核心:獎勵重點放在答案正確率、推理步驟合理性與精簡性,不靠人類標記每步驟

硬體 / Hardware:L4 GPU, 訓練 24 小時

框架 / Framework:Transformers, PEFT, bitsandbytes, Unsloth



---

### 使用建議 | Usage Tips

推薦應用:數學解題、邏輯題、逐步問答
適合使用類似:「請自行分步推理,說明每一步的原因。」等提示語
可將 output 步驟數量根據需求進行控制(建議 2~6 步為最佳)


---



### 快速上手 | Quickstart

```python
print("Hello, world!")
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BryanADA/qwen-2.5-3b-cot-zh-tw")
tokenizer = AutoTokenizer.from_pretrained("BryanADA/qwen-2.5-3b-cot-zh-tw")

prompt = "小明有 3 顆蘋果,又拿到 2 顆,一共幾顆?請分步說明推理過程。"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

### 參考資料 | References

Qwen 官方

Deepseek R1 論文

DoggiAI/GSM8K_zh_tw

RLHF/GRPO 相關文獻



---

### License

本模型採用 Apache-2.0 授權,允許用於研究、學術及商業用途。請遵循授權條款保留原作者版權及免責聲明。

> This model is licensed under Apache-2.0, allowing use for research, academic, and commercial purposes. Please comply with the license terms and retain the original copyright and disclaimers.




---