File size: 5,453 Bytes
dc04fa8 3265147 7ae26af ed95823 dc04fa8 7ae26af 98bbc32 dc04fa8 7ae26af dc04fa8 7ae26af 98bbc32 7ae26af dc04fa8 7ae26af dc04fa8 7ae26af 98bbc32 7ae26af 98bbc32 7ae26af dc04fa8 7ae26af dc04fa8 7ae26af 98bbc32 7ae26af dc04fa8 7ae26af 8fd1408 98bbc32 7ae26af afe767e 7ae26af 8fd1408 98bbc32 8fd1408 7ae26af afe767e 7ae26af 98bbc32 7ae26af 98bbc32 7ae26af 98bbc32 7ae26af dc04fa8 7ae26af dc04fa8 7ae26af dc04fa8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
---
license: apache-2.0
datasets:
- DoggiAI/GSM8K_zh_tw
language:
- zh
- en
pipeline_tag: text-generation
tags:
- text-generation-inference
- chain-of-thought
- qwen
- traditional-chinese
- reasoning
- rlhf
- grpo
- cot
- local-deploy
library_name: transformers
model_creator: BryanADA
base_model:
- Qwen/Qwen2.5-3B-Instruct
---
# Qwen-2.5-3B-cot-zh-tw (GRPO)
---
### 模型簡介 | Model Overview
本模型基於 Qwen-2.5-3B-Instruct,專為繁體中文數學/邏輯推理場景設計,不是單純仿製長鏈推理,而是經由創新 RLHF 訓練流程,讓模型自發產生類似「aha moment」的推理能力。
訓練流程為:
1. 以少量高品質多步推理 SFT 數據冷啟動模型
2. 接著採用 GRPO 策略與自設獎勵函數,在「問題—答案」數據集下強化模型「自發推理」與「步驟合理性」,不依賴 PPO 或大規模 SFT
3. 推理長度經本地測試優化,步驟數落在最佳甜蜜點,適合部署於一般 GPU/邊緣裝置
> This model is based on Qwen-2.5-3B-Instruct and optimized for step-by-step reasoning in Traditional Chinese math and logic tasks.
Instead of standard CoT SFT or PPO, we use a minimal SFT “cold start” with high-quality reasoning samples, then apply GRPO with a custom reward function to let the model discover its own “aha moments” and multi-step reasoning chains—all while controlling output length for efficient local deployment.
---
### 訓練動機 | Motivation
這個專案純粹出於個人興趣與繁體中文圈的實際需求而開發,旨在讓本地端能有更好用的推理模型。
我希望能為繁體中文圈貢獻可以本地部署的推理模型,讓模型在繁中語境下,也能自發產生多步驟、具備頓悟感(aha moment)的解題過程。
整體訓練過程強調實驗精神與實用導向:只用少量優質 SFT 冷啟動,再透過 GRPO 與自定獎勵函數,讓模型自己學會思考與拆解問題,追求可實際落地的推理效果。
> This project was developed purely out of personal interest and the practical needs of the Traditional Chinese-speaking community, with the goal of creating a more effective locally deployable reasoning model.
I hope to contribute a model that enables multi-step, “aha moment” reasoning in the Traditional Chinese context—one that can be run efficiently on local hardware.
The entire training process emphasizes both experimentation and real-world usability: starting with a small amount of high-quality SFT for cold start, then leveraging GRPO and custom reward functions to encourage the model to learn reasoning and problem decomposition on its own, with a focus on truly applicable step-by-step solutions.
---
### 模型特性 | Key Features
Aha Moment 自發推理:非模板複製,而是訓練模型「自己發現推理步驟」
步驟最佳化:推理長度經本地實測,控制在「解題效率」與「可讀性」甜蜜點
繁體中文強化:涵蓋台灣、港澳常用語境,數學、邏輯均表現穩定
適用本地端部署:硬體需求親民,適合一般 4GB GPU,步驟控制精簡,不會無限發散
Self-generated Reasoning: Not pattern imitation—model “discovers” solution steps
Optimized Step Count: Output length tuned for real-world efficiency and readability
Traditional Chinese Enhanced: Robust for Taiwan/HK/Macau context and math/logic QA
Local Deployment Friendly: Runs on standard consumer GPUs, with step count sweet spot
---
### 訓練細節 | Training Details
基礎模型 / Base:Qwen2.5-3B-Instruct
流程 / Pipeline:少量高品質多步推理 SFT(冷啟動)→ GRPO(自訂獎勵函數,自發推理強化)
數據來源 / Data:DoggiAI/GSM8K_zh_tw + 自建繁體推理 Q&A (From Grok API Distill)
RLHF 核心:獎勵重點放在答案正確率、推理步驟合理性與精簡性,不靠人類標記每步驟
硬體 / Hardware:L4 GPU, 訓練 24 小時
框架 / Framework:Transformers, PEFT, bitsandbytes, Unsloth
---
### 使用建議 | Usage Tips
推薦應用:數學解題、邏輯題、逐步問答
適合使用類似:「請自行分步推理,說明每一步的原因。」等提示語
可將 output 步驟數量根據需求進行控制(建議 2~6 步為最佳)
---
### 快速上手 | Quickstart
```python
print("Hello, world!")
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("BryanADA/qwen-2.5-3b-cot-zh-tw")
tokenizer = AutoTokenizer.from_pretrained("BryanADA/qwen-2.5-3b-cot-zh-tw")
prompt = "小明有 3 顆蘋果,又拿到 2 顆,一共幾顆?請分步說明推理過程。"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
### 參考資料 | References
Qwen 官方
Deepseek R1 論文
DoggiAI/GSM8K_zh_tw
RLHF/GRPO 相關文獻
---
### License
本模型採用 Apache-2.0 授權,允許用於研究、學術及商業用途。請遵循授權條款保留原作者版權及免責聲明。
> This model is licensed under Apache-2.0, allowing use for research, academic, and commercial purposes. Please comply with the license terms and retain the original copyright and disclaimers.
---
|