--- tags: - model - fine-tuning - reinforcement-learning - qwen - gsm8k license: mit language: en library_name: transformers datasets: - gsm8k --- # ๐Ÿš€ Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset ## ๐ŸŒŸ Introduction This repository presents an optimized fine-tuning workflow for **Qwen2.5-3B-Instruct**, leveraging **GRPO (Generalized Reward Policy Optimization)** to enhance its mathematical reasoning skills on the **GSM8K dataset**. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency. ## ๐Ÿ” Overview The notebook follows a structured approach: 1. **๐Ÿ”ง Installation**: Set up required libraries like `unsloth`, `vllm`, and `trl` for high-speed fine-tuning and inference. 2. **โšก Unsloth Setup**: Optimize training with **Unslothโ€™s PatchFastRL**, and integrate **LoRA (Low-Rank Adaptation)** for memory-efficient tuning. 3. **๐Ÿ“Š Data Preparation**: Preprocess the **GSM8K dataset** with a **system prompt** and **XML-structured reasoning and answer format**. 4. **๐Ÿ† Reward Functions**: Implement multiple scoring mechanisms to refine model outputs: - โœ… **Correctness Reward**: Validates whether the predicted answer matches the ground truth. - ๐Ÿ“‘ **Format Reward**: Ensures compliance with the designated XML structure. - ๐Ÿ”ข **Integer Reward**: Confirms that the extracted answer is a valid integer. - ๐Ÿ—๏ธ **XML Count Reward**: Evaluates the completeness of the structured response. 5. **๐ŸŽฏ GRPO Training**: Configure and execute **GRPO training** with **vLLM**, optimizing responses with reinforcement learning. 6. **๐Ÿ“ˆ Training Progress**: Track essential metricsโ€”reward improvements, completion length, and KL divergenceโ€”to ensure steady performance gains. ## โšก Key Features - **๐Ÿš€ High-Efficiency Fine-Tuning**: Combines **Unsloth** and **LoRA** for fast, memory-efficient training. - **๐Ÿ› ๏ธ Custom Reward Functions**: Fine-tunes responses using precision-oriented rewards. - **โšก vLLM Integration**: Accelerates inference and reward-based optimization. - **๐Ÿ“š GSM8K Dataset Focus**: Enhances problem-solving accuracy in mathematical reasoning tasks. ## ๐Ÿ”ง Requirements - **Python 3.11** - Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers` ## ๐Ÿ› ๏ธ Installation Set up your environment with: ```bash pip install unsloth vllm trl ``` ## ๐ŸŽฌ Usage 1. **Load the Model**: Initialize **Qwen2.5-3B-Instruct** with **LoRA** for fine-tuning. 2. **Prepare the Dataset**: Format the **GSM8K dataset** using a structured **system prompt** and **XML reasoning style**. 3. **Define Reward Functions**: Implement custom reward functions for guided learning. 4. **Train the Model**: Run the **GRPO trainer** to optimize responses using reinforcement learning. 5. **Monitor Progress**: Analyze rewards, response length, and performance trends in real-time. ## ๐Ÿ“Š Results The fine-tuning process refines the modelโ€™s ability to generate **precise and structured answers** to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking. ## ๐Ÿ”ฎ Future Work - **๐ŸŽฏ Hyperparameter Tuning**: Experimenting with learning rates, batch sizes, and reward weights for better outcomes. - **๐Ÿ“‚ Additional Datasets**: Expanding fine-tuning to diverse datasets for broader generalization. - **๐Ÿง  Advanced Reward Functions**: Developing sophisticated reward mechanisms to further enhance response quality. ## ๐Ÿ™Œ Acknowledgments - **Unsloth**: For pioneering accelerated fine-tuning. - **vLLM**: For lightning-fast inference. - **Hugging Face**: For providing the `trl` library and **GSM8K dataset**. - **Special thanks to @sudhir2016 sir** for invaluable mentorship and guidance. ## ๐Ÿ“œ License This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details. --- โœจ Happy fine-tuning! Keep pushing AI boundaries! ๐Ÿš€