Model Card for Model ID

Model Details

此模型微調自 benchang1110/Qwen2.5-Taiwan-3B-Reason-SFT (基於阿里巴巴千問)。由 GRPO 方法訓練，資料集為 benchang1110/Big-Math-RL-Verified-zhtw，僅採用 solve rate 大於 0.7 的題目作為訓練。

Model Description

Developed by: benchang1110
Funded by [optional]: myself
Model type: Qwen2ForCausalLM
Language(s) (NLP): Tranditional Chinese (zh-tw), English
License: qwen-research
Finetuned from model [optional]: benchang1110/Qwen2.5-Taiwan-3B-Reason-SFT

Training Details

Hyperparameter	Value
learning_rate	2e-6
num_generations	8
num_epochs	1
warmup_ratio	0.07
lr_scheduler	cosine
max_seq_length	2048

Uses

此模型能用來回答數學問題，<think> 已經加在 chat template 當中。

**USER**:Eliza每週前40小時的時薪是10美元。她的加班工資是平時每小時工資的1.2倍。如果Eliza這周工作了45小時，那麼她這周的收入是多少？
**ASSISTANT**:
嗯，好的，我現在要解決這個問題。Eliza每週工作的時薪是10美元，前40小時按這個標準算，超過的部分就是加班工資，也就是平時每小時工資的1.2倍。現在她這周工作了45小時，那她的總收入是多少呢？

首先，我需要明確計算方式。前40小時是正常時間，每小時10美元。剩下的5小時就是加班時間，加班費是平時的1.2倍，也就是10美元乘以1.2等於12美元每小時對吧？

那這樣的話，她的總工資應該是前40小時的工資加上加班的5小時工資。所以先算40小時的部分：40乘以10，等於400美元。然後加班的是5小時，每小時12美元，所以5乘以12等於60美元。然後把這兩部分加起來，400加60，結果就是460美元。

不過，我要仔細檢查一下有沒有哪裡弄錯了。比如，是否加班的時間是正確的？題目說她工作了45小時，那前40小時是正常，剩下的就是45-40=5小時加班。沒錯，所以計算是對的。
</think>


Eliza每週前40小時的時薪為10美元，超過部分（45 - 40 = 5小時）按平時工資的1.2倍計算。  
- **正常工作時間收入**：40小時 × $10/hour = $400  
- **加班工資**：5小時 × ($10/hour × 1.2) = 5 × $12 = $60  
- **總工資**：$400 + $60 = **$460**

最終答案：\boxed{460}<｜end▁of▁sentence｜>

**USER**:創作一首唐詩，描寫洗澡洗到一半沒有熱水的無奈和憤怒
**Assistant**: 
佇立于深宮中，沐浴已半程。
天公不作美，水溫漸轉冷。
銀瓶空自鳴，急切心如焚。
怒向池邊發，何曾有暖春。
恨此一滴淚，滴落心中痕。
願得龍王現，速來解我困。
奈何無處尋，徒留空悲鳴。
人生多舛時，何以排憂情。
但願清風至，驅散我煩膺。
待到日出時，浴罷笑談中。</think>
這首詩描繪了詩人因洗澡時突然斷水而產生的無奈與憤怒。首句點明自己正在深宮中沐浴，半程已過卻遇冷水，形象地表現了等待熱水的急切與失望。接著“天公不作美”一句，化用古語，將自然現象擬人化，表達對天公的不滿。詩人借銀瓶的哀鳴聲，直接傳達內心的憤怒與悲憤。後四句透過“恨”、“驅散”等詞，進一步強化了對現狀的強烈不滿，同時表達了對龍王（神靈）的幫助的祈求。然而現實的殘酷讓詩人只能徒然嘆息，最終期待清風帶來的希望，結尾則以樂觀的態度收尾，暗示雖然當前困境，但仍對未來充滿信心。整體上，這首詩運用了擬人、對比等手法，生動地表達了作者在生活中的小挫折所帶來的大情緒波動。<｜end▁of▁sentence｜>

Bias, Risks, and Limitations

此模型主要是用來回答數學問題，不適合做為日常問答使用。另外，因為僅使用單輪對話做訓練，多輪對話可能會產生意外的輸出。

How to Get Started with the Model

使用下面程式碼可以進行單輪對話。

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch, transformers
from transformers import TextStreamer

def generate_response_fp():
    tokenizer = transformers.AutoTokenizer.from_pretrained("benchang1110/Qwen2.5-Taiwan-3B-Reason-GRPO", use_fast=False)
    model = transformers.AutoModelForCausalLM.from_pretrained("benchang1110/Qwen2.5-Taiwan-3B-Reason-GRPO", torch_dtype=torch.bfloat16, device_map=device,attn_implementation="flash_attention_2")
    
    streamer = TextStreamer(tokenizer, skip_prompt=True)
    while(1):
        prompt = input('USER:')
        if prompt == "exit":
            break
        print("Assistant: ")
        message = [
            {'content': prompt, 'role': 'user'},
        ]
        
        tokenized_chat = tokenizer.apply_chat_template(message,tokenize=True,add_generation_prompt=True,return_tensors='pt').to(device)
        _ = model.generate(tokenized_chat,use_cache=True,max_new_tokens=2048,streamer=streamer)

if __name__ == '__main__':
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    generate_response_fp()

Evaluation

我們將常見的數學 benchmark 翻譯成繁體中文或經過 opencc 簡繁轉換後再進行評估。

	MATH-500	GSM8K	CMATH
SFT	28.4	61.5	73.4
GRPO	31.0	68.8	75.4

Citation

@misc{qwen2.5,
    title = {Qwen2.5: A Party of Foundation Models},
    url = {https://qwenlm.github.io/blog/qwen2.5/},
    author = {Qwen Team},
    month = {September},
    year = {2024}
}

@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
      title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, 
      author={DeepSeek-AI},
      year={2025},
      eprint={2501.12948},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.12948}, 
}

@misc{Chinese-Data-Distill-From-R1,
  author = {Cong Liu, Zhong Wang, ShengYu Shen, Jialiang Peng, Xiaoli Zhang, ZhenDong Du, YaFang Wang},
  title = {The Chinese dataset distilled from DeepSeek-R1-671b},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k}},
}

@misc{albalak2025bigmathlargescalehighqualitymath,
      title={Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models}, 
      author={Alon Albalak and Duy Phung and Nathan Lile and Rafael Rafailov and Kanishk Gandhi and Louis Castricato and Anikait Singh and Chase Blagden and Violet Xiang and Dakota Mahan and Nick Haber},
      year={2025},
      eprint={2502.17387},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.17387}, 
}

benchang1110
/

Qwen2.5-Taiwan-3B-Reason-GRPO

Model Card for Model ID

Model Details

Model Description

Training Details

Uses

Bias, Risks, and Limitations

How to Get Started with the Model

Evaluation

Citation

Model tree for benchang1110/Qwen2.5-Taiwan-3B-Reason-GRPO

Dataset used to train benchang1110/Qwen2.5-Taiwan-3B-Reason-GRPO

Collection including benchang1110/Qwen2.5-Taiwan-3B-Reason-GRPO

Qwen2.5-Taiwan