---
tags:
- model
- fine-tuning
- reinforcement-learning
- qwen
- gsm8k
license: mit
language: en
library_name: transformers
datasets:
- gsm8k
---


# 🚀 Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset

## 🌟 Introduction
This repository presents an optimized fine-tuning workflow for **Qwen2.5-3B-Instruct**, leveraging **GRPO (Generalized Reward Policy Optimization)** to enhance its mathematical reasoning skills on the **GSM8K dataset**. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency.

## 🔍 Overview
The notebook follows a structured approach:

1. **🔧 Installation**: Set up required libraries like `unsloth`, `vllm`, and `trl` for high-speed fine-tuning and inference.
2. **⚡ Unsloth Setup**: Optimize training with **Unsloth’s PatchFastRL**, and integrate **LoRA (Low-Rank Adaptation)** for memory-efficient tuning.
3. **📊 Data Preparation**: Preprocess the **GSM8K dataset** with a **system prompt** and **XML-structured reasoning and answer format**.
4. **🏆 Reward Functions**: Implement multiple scoring mechanisms to refine model outputs:
   - ✅ **Correctness Reward**: Validates whether the predicted answer matches the ground truth.
   - 📑 **Format Reward**: Ensures compliance with the designated XML structure.
   - 🔢 **Integer Reward**: Confirms that the extracted answer is a valid integer.
   - 🏗️ **XML Count Reward**: Evaluates the completeness of the structured response.
5. **🎯 GRPO Training**: Configure and execute **GRPO training** with **vLLM**, optimizing responses with reinforcement learning.
6. **📈 Training Progress**: Track essential metrics—reward improvements, completion length, and KL divergence—to ensure steady performance gains.

## ⚡ Key Features
- **🚀 High-Efficiency Fine-Tuning**: Combines **Unsloth** and **LoRA** for fast, memory-efficient training.
- **🛠️ Custom Reward Functions**: Fine-tunes responses using precision-oriented rewards.
- **⚡ vLLM Integration**: Accelerates inference and reward-based optimization.
- **📚 GSM8K Dataset Focus**: Enhances problem-solving accuracy in mathematical reasoning tasks.

## 🔧 Requirements
- **Python 3.11**
- Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers`

## 🛠️ Installation
Set up your environment with:
```bash
pip install unsloth vllm trl
```

## 🎬 Usage
1. **Load the Model**: Initialize **Qwen2.5-3B-Instruct** with **LoRA** for fine-tuning.
2. **Prepare the Dataset**: Format the **GSM8K dataset** using a structured **system prompt** and **XML reasoning style**.
3. **Define Reward Functions**: Implement custom reward functions for guided learning.
4. **Train the Model**: Run the **GRPO trainer** to optimize responses using reinforcement learning.
5. **Monitor Progress**: Analyze rewards, response length, and performance trends in real-time.

## 📊 Results
The fine-tuning process refines the model’s ability to generate **precise and structured answers** to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking.

## 🔮 Future Work
- **🎯 Hyperparameter Tuning**: Experimenting with learning rates, batch sizes, and reward weights for better outcomes.
- **📂 Additional Datasets**: Expanding fine-tuning to diverse datasets for broader generalization.
- **🧠 Advanced Reward Functions**: Developing sophisticated reward mechanisms to further enhance response quality.

## 🙌 Acknowledgments
- **Unsloth**: For pioneering accelerated fine-tuning.
- **vLLM**: For lightning-fast inference.
- **Hugging Face**: For providing the `trl` library and **GSM8K dataset**.
- **Special thanks to @sudhir2016 sir** for invaluable mentorship and guidance.

## 📜 License
This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.

---

✨ Happy fine-tuning! Keep pushing AI boundaries! 🚀