๐ Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
๐ Introduction
This repository presents an optimized fine-tuning workflow for Qwen2.5-3B-Instruct, leveraging GRPO (Generalized Reward Policy Optimization) to enhance its mathematical reasoning skills on the GSM8K dataset. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency.
๐ Overview
The notebook follows a structured approach:
- ๐ง Installation: Set up required libraries like
unsloth
,vllm
, andtrl
for high-speed fine-tuning and inference. - โก Unsloth Setup: Optimize training with Unslothโs PatchFastRL, and integrate LoRA (Low-Rank Adaptation) for memory-efficient tuning.
- ๐ Data Preparation: Preprocess the GSM8K dataset with a system prompt and XML-structured reasoning and answer format.
- ๐ Reward Functions: Implement multiple scoring mechanisms to refine model outputs:
- โ Correctness Reward: Validates whether the predicted answer matches the ground truth.
- ๐ Format Reward: Ensures compliance with the designated XML structure.
- ๐ข Integer Reward: Confirms that the extracted answer is a valid integer.
- ๐๏ธ XML Count Reward: Evaluates the completeness of the structured response.
- ๐ฏ GRPO Training: Configure and execute GRPO training with vLLM, optimizing responses with reinforcement learning.
- ๐ Training Progress: Track essential metricsโreward improvements, completion length, and KL divergenceโto ensure steady performance gains.
โก Key Features
- ๐ High-Efficiency Fine-Tuning: Combines Unsloth and LoRA for fast, memory-efficient training.
- ๐ ๏ธ Custom Reward Functions: Fine-tunes responses using precision-oriented rewards.
- โก vLLM Integration: Accelerates inference and reward-based optimization.
- ๐ GSM8K Dataset Focus: Enhances problem-solving accuracy in mathematical reasoning tasks.
๐ง Requirements
- Python 3.11
- Required Libraries:
unsloth
,vllm
,trl
,torch
,transformers
๐ ๏ธ Installation
Set up your environment with:
pip install unsloth vllm trl
๐ฌ Usage
- Load the Model: Initialize Qwen2.5-3B-Instruct with LoRA for fine-tuning.
- Prepare the Dataset: Format the GSM8K dataset using a structured system prompt and XML reasoning style.
- Define Reward Functions: Implement custom reward functions for guided learning.
- Train the Model: Run the GRPO trainer to optimize responses using reinforcement learning.
- Monitor Progress: Analyze rewards, response length, and performance trends in real-time.
๐ Results
The fine-tuning process refines the modelโs ability to generate precise and structured answers to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking.
๐ฎ Future Work
- ๐ฏ Hyperparameter Tuning: Experimenting with learning rates, batch sizes, and reward weights for better outcomes.
- ๐ Additional Datasets: Expanding fine-tuning to diverse datasets for broader generalization.
- ๐ง Advanced Reward Functions: Developing sophisticated reward mechanisms to further enhance response quality.
๐ Acknowledgments
- Unsloth: For pioneering accelerated fine-tuning.
- vLLM: For lightning-fast inference.
- Hugging Face: For providing the
trl
library and GSM8K dataset. - Special thanks to @sudhir2016 sir for invaluable mentorship and guidance.
๐ License
This project is licensed under the MIT License. See the LICENSE file for details.
โจ Happy fine-tuning! Keep pushing AI boundaries! ๐