---
license: apache-2.0
language: en
tags:
  - text-generation
  - causal-lm
  - reinforcement-learning
  - GRPO
  - instruction-tuning
  - chain-of-thought
datasets:
  - gsm8k
pipeline_tag: text-generation
widget:
  - text: "What is 27 plus 16? Let's think step by step."
---

# GRPO: Finetuned Causal Language Model using Group Relative Policy Optimization


## 🧠 Model Description

**GRPO** is a causal language model fine-tuned using **Group Relative Policy Optimization (GRPO)**, a reinforcement learning algorithm built on PPO that optimizes language models via groupwise reward comparisons. This approach aligns model outputs with reward functions through relative ranking among multiple completions per prompt, making it well-suited for structured generation tasks such as **Chain-of-Thought (CoT)** reasoning.


- **Base model**: `HuggingFaceTB/SmolLM-135M-Instruct`  
- **Language**: English  
- **License**: Apache 2.0  
- **Trained by**: City Sun  
- **Training Duration**: ~4 hours (RTX 3060 6GB)  
- **Dataset**: GSM8K with CoT formatting  

---

## ✅ Intended Uses

### Direct Use

- Math/Logic word problem solving  
- Educational QA  
- Chain-of-Thought aligned instruction following  

### Out-of-Scope Use ❌

- Real-time decision making  
- Sensitive domains: healthcare, law, finance  

---

## 🛠 Training Details

- **Algorithm**: GRPO  
- **Epochs**: 3  
- **Batch size**: 4  
- **Learning rate**: 1e-5  
- **Reward function**: `reward_len` (length-based)  
- **KL penalty**: 0.1  
- **Max sequence length**: 512  
- **Compute**: RTX 3060 6GB  

---

## 📈 W&B Metrics (Summary)

- Final reward: ~0.28  
- Final loss: ~0.015  
- Total training steps: ~500  
- Avg generation length: ~64 tokens  

---

## 🚀 How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("HuangXinBa/GRPO")
tokenizer = AutoTokenizer.from_pretrained("HuangXinBa/GRPO")

prompt = "<your prompt here>"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

@misc{grpo2025,
  title={GRPO-1: Finetuning a Language Model with Generalized Reinforcement Policy Optimization},
  author={Huang Jinting},
  year={2025},
  note={https://wandb.ai/ggg7334-the-school-of-the-new-york-times/GRPO}
}