Qwen-2.5-3B-cot-zh-tw (GRPO)


模型簡介 | Model Overview

本模型基於 Qwen-2.5-3B-Instruct,專為繁體中文數學/邏輯推理場景設計,不是單純仿製長鏈推理,而是經由創新 RLHF 訓練流程,讓模型自發產生類似「aha moment」的推理能力。

訓練流程為:

  1. 以少量高品質多步推理 SFT 數據冷啟動模型

  2. 接著採用 GRPO 策略與自設獎勵函數,在「問題—答案」數據集下強化模型「自發推理」與「步驟合理性」,不依賴 PPO 或大規模 SFT

  3. 推理長度經本地測試優化,步驟數落在最佳甜蜜點,適合部署於一般 GPU/邊緣裝置

This model is based on Qwen-2.5-3B-Instruct and optimized for step-by-step reasoning in Traditional Chinese math and logic tasks. Instead of standard CoT SFT or PPO, we use a minimal SFT “cold start” with high-quality reasoning samples, then apply GRPO with a custom reward function to let the model discover its own “aha moments” and multi-step reasoning chains—all while controlling output length for efficient local deployment.


訓練動機 | Motivation

這個專案純粹出於個人興趣與繁體中文圈的實際需求而開發,旨在讓本地端能有更好用的推理模型。 我希望能為繁體中文圈貢獻可以本地部署的推理模型,讓模型在繁中語境下,也能自發產生多步驟、具備頓悟感(aha moment)的解題過程。 整體訓練過程強調實驗精神與實用導向:只用少量優質 SFT 冷啟動,再透過 GRPO 與自定獎勵函數,讓模型自己學會思考與拆解問題,追求可實際落地的推理效果。

This project was developed purely out of personal interest and the practical needs of the Traditional Chinese-speaking community, with the goal of creating a more effective locally deployable reasoning model. I hope to contribute a model that enables multi-step, “aha moment” reasoning in the Traditional Chinese context—one that can be run efficiently on local hardware. The entire training process emphasizes both experimentation and real-world usability: starting with a small amount of high-quality SFT for cold start, then leveraging GRPO and custom reward functions to encourage the model to learn reasoning and problem decomposition on its own, with a focus on truly applicable step-by-step solutions.


模型特性 | Key Features

Aha Moment 自發推理:非模板複製,而是訓練模型「自己發現推理步驟」

步驟最佳化:推理長度經本地實測,控制在「解題效率」與「可讀性」甜蜜點

繁體中文強化:涵蓋台灣、港澳常用語境,數學、邏輯均表現穩定

適用本地端部署:硬體需求親民,適合一般 4GB GPU,步驟控制精簡,不會無限發散

Self-generated Reasoning: Not pattern imitation—model “discovers” solution steps

Optimized Step Count: Output length tuned for real-world efficiency and readability

Traditional Chinese Enhanced: Robust for Taiwan/HK/Macau context and math/logic QA

Local Deployment Friendly: Runs on standard consumer GPUs, with step count sweet spot


訓練細節 | Training Details

基礎模型 / Base:Qwen2.5-3B-Instruct

流程 / Pipeline:少量高品質多步推理 SFT(冷啟動)→ GRPO(自訂獎勵函數,自發推理強化)

數據來源 / Data:DoggiAI/GSM8K_zh_tw + 自建繁體推理 Q&A (From Grok API Distill)

RLHF 核心:獎勵重點放在答案正確率、推理步驟合理性與精簡性,不靠人類標記每步驟

硬體 / Hardware:L4 GPU, 訓練 24 小時

框架 / Framework:Transformers, PEFT, bitsandbytes, Unsloth


使用建議 | Usage Tips

推薦應用:數學解題、邏輯題、逐步問答 適合使用類似:「請自行分步推理,說明每一步的原因。」等提示語 可將 output 步驟數量根據需求進行控制(建議 2~6 步為最佳)


快速上手 | Quickstart

print("Hello, world!")
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("BryanADA/qwen-2.5-3b-cot-zh-tw")
tokenizer = AutoTokenizer.from_pretrained("BryanADA/qwen-2.5-3b-cot-zh-tw")

prompt = "小明有 3 顆蘋果,又拿到 2 顆,一共幾顆?請分步說明推理過程。"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

參考資料 | References

Qwen 官方

Deepseek R1 論文

DoggiAI/GSM8K_zh_tw

RLHF/GRPO 相關文獻


License

本模型採用 Apache-2.0 授權,允許用於研究、學術及商業用途。請遵循授權條款保留原作者版權及免責聲明。

This model is licensed under Apache-2.0, allowing use for research, academic, and commercial purposes. Please comply with the license terms and retain the original copyright and disclaimers.


Downloads last month
53
GGUF
Model size
3.09B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BryanADA/Qwen2.5-3B-cot-zh-tw

Base model

Qwen/Qwen2.5-3B
Quantized
(142)
this model

Dataset used to train BryanADA/Qwen2.5-3B-cot-zh-tw