koolkarni-Atharva10
/

Nano_R1

@@ -1,58 +1,62 @@
-# Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
-## Introduction
-This repository provides a notebook demonstrating the fine-tuning of the **Qwen2.5-3B-Instruct** model using **GRPO (Generalized Reward Policy Optimization)** on the **GSM8K dataset**. The objective is to enhance the model's ability to solve mathematical reasoning problems using reinforcement learning with custom reward functions.
-## Overview
-The notebook is structured as follows:
-1. **Installation**: Installs necessary libraries such as `unsloth`, `vllm`, and `trl` for efficient fine-tuning and inference.
-2. **Unsloth Setup**: Configures the environment for faster fine-tuning using **Unsloth's PatchFastRL** and loads the **Qwen2.5-3B-Instruct** model with **LoRA** (Low-Rank Adaptation) for parameter-efficient fine-tuning.
-3. **Data Preparation**: Loads and preprocesses the **GSM8K dataset**, formatting it for training with a system prompt and **XML-style reasoning and answer format**.
-4. **Reward Functions**: Defines custom reward functions to evaluate the model's responses:
-   - **Correctness Reward**: Checks if the extracted answer matches the ground truth.
-   - **Format Reward**: Ensures the response follows the specified XML format.
-   - **Integer Reward**: Verifies if the extracted answer is an integer.
-   - **XML Count Reward**: Evaluates the completeness of the XML structure in the response.
-5. **GRPO Training**: Configures and runs the **GRPO trainer** with **vLLM** for fast inference, using the defined reward functions to optimize performance.
-6. **Training Progress**: Monitors progress, including rewards, completion length, and KL divergence to ensure improvements over time.
-## Key Features
-- **Efficient Fine-Tuning**: Utilizes **Unsloth** and **LoRA** to fine-tune the model with reduced memory usage and faster training times.
-- **Custom Reward Functions**: Implements multiple reward functions to guide the model toward correct and well-formatted responses.
-- **vLLM Integration**: Uses **vLLM** for fast inference, enabling efficient generation of multiple responses for reward calculation.
-- **GSM8K Dataset**: Focuses on enhancing the model’s performance in **mathematical reasoning tasks**.
-## Requirements
 - **Python 3.11**
 - Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers`
-## Installation
-To set up the environment, run:
 ```bash
 pip install unsloth vllm trl
 ```
-## Usage
-1. **Load the Model**: The notebook loads the **Qwen2.5-3B-Instruct** model with **LoRA** for fine-tuning.
-2. **Prepare the Dataset**: The **GSM8K dataset** is loaded and formatted with a **system prompt** and **XML-style reasoning and answer format**.
-3. **Define Reward Functions**: Custom reward functions are defined to evaluate the model's responses.
-4. **Train the Model**: The **GRPO trainer** is configured and run to fine-tune the model using the defined reward functions.
-5. **Monitor Progress**: Training progress is monitored, including **rewards, completion length, and KL divergence**.
-## Results
-The training process aims to enhance the model’s ability to generate **accurate and well-structured responses** to mathematical reasoning problems. The reward functions ensure the model improves over time, and training progress is logged for analysis.
-## Future Work
-- **Hyperparameter Tuning**: Experimenting with learning rates, batch sizes, and reward weights to optimize performance.
-- **Additional Datasets**: Extending fine-tuning to other datasets for improved generalization.
-- **Advanced Reward Functions**: Implementing more sophisticated reward functions for refined responses.
-## Acknowledgments
-- **Unsloth**: For providing tools to speed up fine-tuning.
-- **vLLM**: For enabling fast inference during training.
-- **Hugging Face**: For the `trl` library and the **GSM8K dataset**.
-- **Special thanks to @sudhir2016 sir** for mentoring and guiding this project.
-## License
 This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.

+# 🚀 Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
+## 🌟 Introduction
+This repository presents an optimized fine-tuning workflow for **Qwen2.5-3B-Instruct**, leveraging **GRPO (Generalized Reward Policy Optimization)** to enhance its mathematical reasoning skills on the **GSM8K dataset**. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency.
+## 🔍 Overview
+The notebook follows a structured approach:
+1. **🔧 Installation**: Set up required libraries like `unsloth`, `vllm`, and `trl` for high-speed fine-tuning and inference.
+2. **⚡ Unsloth Setup**: Optimize training with **Unsloth’s PatchFastRL**, and integrate **LoRA (Low-Rank Adaptation)** for memory-efficient tuning.
+3. **📊 Data Preparation**: Preprocess the **GSM8K dataset** with a **system prompt** and **XML-structured reasoning and answer format**.
+4. **🏆 Reward Functions**: Implement multiple scoring mechanisms to refine model outputs:
+   - ✅ **Correctness Reward**: Validates whether the predicted answer matches the ground truth.
+   - 📑 **Format Reward**: Ensures compliance with the designated XML structure.
+   - 🔢 **Integer Reward**: Confirms that the extracted answer is a valid integer.
+   - 🏗️ **XML Count Reward**: Evaluates the completeness of the structured response.
+5. **🎯 GRPO Training**: Configure and execute **GRPO training** with **vLLM**, optimizing responses with reinforcement learning.
+6. **📈 Training Progress**: Track essential metrics—reward improvements, completion length, and KL divergence—to ensure steady performance gains.
+## ⚡ Key Features
+- **🚀 High-Efficiency Fine-Tuning**: Combines **Unsloth** and **LoRA** for fast, memory-efficient training.
+- **🛠️ Custom Reward Functions**: Fine-tunes responses using precision-oriented rewards.
+- **⚡ vLLM Integration**: Accelerates inference and reward-based optimization.
+- **📚 GSM8K Dataset Focus**: Enhances problem-solving accuracy in mathematical reasoning tasks.
+## 🔧 Requirements
 - **Python 3.11**
 - Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers`
+## 🛠️ Installation
+Set up your environment with:
 ```bash
 pip install unsloth vllm trl
 ```
+## 🎬 Usage
+1. **Load the Model**: Initialize **Qwen2.5-3B-Instruct** with **LoRA** for fine-tuning.
+2. **Prepare the Dataset**: Format the **GSM8K dataset** using a structured **system prompt** and **XML reasoning style**.
+3. **Define Reward Functions**: Implement custom reward functions for guided learning.
+4. **Train the Model**: Run the **GRPO trainer** to optimize responses using reinforcement learning.
+5. **Monitor Progress**: Analyze rewards, response length, and performance trends in real-time.
+## 📊 Results
+The fine-tuning process refines the model’s ability to generate **precise and structured answers** to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking.
+## 🔮 Future Work
+- **🎯 Hyperparameter Tuning**: Experimenting with learning rates, batch sizes, and reward weights for better outcomes.
+- **📂 Additional Datasets**: Expanding fine-tuning to diverse datasets for broader generalization.
+- **🧠 Advanced Reward Functions**: Developing sophisticated reward mechanisms to further enhance response quality.
+## 🙌 Acknowledgments
+- **Unsloth**: For pioneering accelerated fine-tuning.
+- **vLLM**: For lightning-fast inference.
+- **Hugging Face**: For providing the `trl` library and **GSM8K dataset**.
+- **Special thanks to @sudhir2016 sir** for invaluable mentorship and guidance.
+## 📜 License
 This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.
+---
+✨ Happy fine-tuning! Keep pushing AI boundaries! 🚀