๐Ÿš€ Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset

๐ŸŒŸ Introduction

This repository presents an optimized fine-tuning workflow for Qwen2.5-3B-Instruct, leveraging GRPO (Generalized Reward Policy Optimization) to enhance its mathematical reasoning skills on the GSM8K dataset. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency.

๐Ÿ” Overview

The notebook follows a structured approach:

  1. ๐Ÿ”ง Installation: Set up required libraries like unsloth, vllm, and trl for high-speed fine-tuning and inference.
  2. โšก Unsloth Setup: Optimize training with Unslothโ€™s PatchFastRL, and integrate LoRA (Low-Rank Adaptation) for memory-efficient tuning.
  3. ๐Ÿ“Š Data Preparation: Preprocess the GSM8K dataset with a system prompt and XML-structured reasoning and answer format.
  4. ๐Ÿ† Reward Functions: Implement multiple scoring mechanisms to refine model outputs:
    • โœ… Correctness Reward: Validates whether the predicted answer matches the ground truth.
    • ๐Ÿ“‘ Format Reward: Ensures compliance with the designated XML structure.
    • ๐Ÿ”ข Integer Reward: Confirms that the extracted answer is a valid integer.
    • ๐Ÿ—๏ธ XML Count Reward: Evaluates the completeness of the structured response.
  5. ๐ŸŽฏ GRPO Training: Configure and execute GRPO training with vLLM, optimizing responses with reinforcement learning.
  6. ๐Ÿ“ˆ Training Progress: Track essential metricsโ€”reward improvements, completion length, and KL divergenceโ€”to ensure steady performance gains.

โšก Key Features

  • ๐Ÿš€ High-Efficiency Fine-Tuning: Combines Unsloth and LoRA for fast, memory-efficient training.
  • ๐Ÿ› ๏ธ Custom Reward Functions: Fine-tunes responses using precision-oriented rewards.
  • โšก vLLM Integration: Accelerates inference and reward-based optimization.
  • ๐Ÿ“š GSM8K Dataset Focus: Enhances problem-solving accuracy in mathematical reasoning tasks.

๐Ÿ”ง Requirements

  • Python 3.11
  • Required Libraries: unsloth, vllm, trl, torch, transformers

๐Ÿ› ๏ธ Installation

Set up your environment with:

pip install unsloth vllm trl

๐ŸŽฌ Usage

  1. Load the Model: Initialize Qwen2.5-3B-Instruct with LoRA for fine-tuning.
  2. Prepare the Dataset: Format the GSM8K dataset using a structured system prompt and XML reasoning style.
  3. Define Reward Functions: Implement custom reward functions for guided learning.
  4. Train the Model: Run the GRPO trainer to optimize responses using reinforcement learning.
  5. Monitor Progress: Analyze rewards, response length, and performance trends in real-time.

๐Ÿ“Š Results

The fine-tuning process refines the modelโ€™s ability to generate precise and structured answers to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking.

๐Ÿ”ฎ Future Work

  • ๐ŸŽฏ Hyperparameter Tuning: Experimenting with learning rates, batch sizes, and reward weights for better outcomes.
  • ๐Ÿ“‚ Additional Datasets: Expanding fine-tuning to diverse datasets for broader generalization.
  • ๐Ÿง  Advanced Reward Functions: Developing sophisticated reward mechanisms to further enhance response quality.

๐Ÿ™Œ Acknowledgments

  • Unsloth: For pioneering accelerated fine-tuning.
  • vLLM: For lightning-fast inference.
  • Hugging Face: For providing the trl library and GSM8K dataset.
  • Special thanks to @sudhir2016 sir for invaluable mentorship and guidance.

๐Ÿ“œ License

This project is licensed under the MIT License. See the LICENSE file for details.


โœจ Happy fine-tuning! Keep pushing AI boundaries! ๐Ÿš€

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train koolkarni-Atharva10/Nano_R1