🚀 Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset

🌟 Introduction

This repository presents an optimized fine-tuning workflow for Qwen2.5-3B-Instruct, leveraging GRPO (Generalized Reward Policy Optimization) to enhance its mathematical reasoning skills on the GSM8K dataset. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency.

🔍 Overview

The notebook follows a structured approach:

🔧 Installation: Set up required libraries like unsloth, vllm, and trl for high-speed fine-tuning and inference.
⚡ Unsloth Setup: Optimize training with Unsloth’s PatchFastRL, and integrate LoRA (Low-Rank Adaptation) for memory-efficient tuning.
📊 Data Preparation: Preprocess the GSM8K dataset with a system prompt and XML-structured reasoning and answer format.
🏆 Reward Functions: Implement multiple scoring mechanisms to refine model outputs:
- ✅ Correctness Reward: Validates whether the predicted answer matches the ground truth.
- 📑 Format Reward: Ensures compliance with the designated XML structure.
- 🔢 Integer Reward: Confirms that the extracted answer is a valid integer.
- 🏗️ XML Count Reward: Evaluates the completeness of the structured response.
🎯 GRPO Training: Configure and execute GRPO training with vLLM, optimizing responses with reinforcement learning.
📈 Training Progress: Track essential metrics—reward improvements, completion length, and KL divergence—to ensure steady performance gains.

⚡ Key Features

🚀 High-Efficiency Fine-Tuning: Combines Unsloth and LoRA for fast, memory-efficient training.
🛠️ Custom Reward Functions: Fine-tunes responses using precision-oriented rewards.
⚡ vLLM Integration: Accelerates inference and reward-based optimization.
📚 GSM8K Dataset Focus: Enhances problem-solving accuracy in mathematical reasoning tasks.

🔧 Requirements

Python 3.11
Required Libraries: unsloth, vllm, trl, torch, transformers

🛠️ Installation

Set up your environment with:

pip install unsloth vllm trl

🎬 Usage

Load the Model: Initialize Qwen2.5-3B-Instruct with LoRA for fine-tuning.
Prepare the Dataset: Format the GSM8K dataset using a structured system prompt and XML reasoning style.
Define Reward Functions: Implement custom reward functions for guided learning.
Train the Model: Run the GRPO trainer to optimize responses using reinforcement learning.
Monitor Progress: Analyze rewards, response length, and performance trends in real-time.

📊 Results

The fine-tuning process refines the model’s ability to generate precise and structured answers to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking.

🔮 Future Work

🎯 Hyperparameter Tuning: Experimenting with learning rates, batch sizes, and reward weights for better outcomes.
📂 Additional Datasets: Expanding fine-tuning to diverse datasets for broader generalization.
🧠 Advanced Reward Functions: Developing sophisticated reward mechanisms to further enhance response quality.

🙌 Acknowledgments

Unsloth: For pioneering accelerated fine-tuning.
vLLM: For lightning-fast inference.
Hugging Face: For providing the trl library and GSM8K dataset.
Special thanks to @sudhir2016 sir for invaluable mentorship and guidance.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

✨ Happy fine-tuning! Keep pushing AI boundaries! 🚀

koolkarni-Atharva10
/

Nano_R1