koolkarni-Atharva10 commited on
Commit
b3150a4
ยท
verified ยท
1 Parent(s): 7753b57

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -46
README.md CHANGED
@@ -1,58 +1,62 @@
1
- # Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
2
-
3
- ## Introduction
4
- This repository provides a notebook demonstrating the fine-tuning of the **Qwen2.5-3B-Instruct** model using **GRPO (Generalized Reward Policy Optimization)** on the **GSM8K dataset**. The objective is to enhance the model's ability to solve mathematical reasoning problems using reinforcement learning with custom reward functions.
5
-
6
- ## Overview
7
- The notebook is structured as follows:
8
-
9
- 1. **Installation**: Installs necessary libraries such as `unsloth`, `vllm`, and `trl` for efficient fine-tuning and inference.
10
- 2. **Unsloth Setup**: Configures the environment for faster fine-tuning using **Unsloth's PatchFastRL** and loads the **Qwen2.5-3B-Instruct** model with **LoRA** (Low-Rank Adaptation) for parameter-efficient fine-tuning.
11
- 3. **Data Preparation**: Loads and preprocesses the **GSM8K dataset**, formatting it for training with a system prompt and **XML-style reasoning and answer format**.
12
- 4. **Reward Functions**: Defines custom reward functions to evaluate the model's responses:
13
- - **Correctness Reward**: Checks if the extracted answer matches the ground truth.
14
- - **Format Reward**: Ensures the response follows the specified XML format.
15
- - **Integer Reward**: Verifies if the extracted answer is an integer.
16
- - **XML Count Reward**: Evaluates the completeness of the XML structure in the response.
17
- 5. **GRPO Training**: Configures and runs the **GRPO trainer** with **vLLM** for fast inference, using the defined reward functions to optimize performance.
18
- 6. **Training Progress**: Monitors progress, including rewards, completion length, and KL divergence to ensure improvements over time.
19
-
20
- ## Key Features
21
- - **Efficient Fine-Tuning**: Utilizes **Unsloth** and **LoRA** to fine-tune the model with reduced memory usage and faster training times.
22
- - **Custom Reward Functions**: Implements multiple reward functions to guide the model toward correct and well-formatted responses.
23
- - **vLLM Integration**: Uses **vLLM** for fast inference, enabling efficient generation of multiple responses for reward calculation.
24
- - **GSM8K Dataset**: Focuses on enhancing the modelโ€™s performance in **mathematical reasoning tasks**.
25
-
26
- ## Requirements
27
  - **Python 3.11**
28
  - Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers`
29
 
30
- ## Installation
31
- To set up the environment, run:
32
  ```bash
33
  pip install unsloth vllm trl
34
  ```
35
 
36
- ## Usage
37
- 1. **Load the Model**: The notebook loads the **Qwen2.5-3B-Instruct** model with **LoRA** for fine-tuning.
38
- 2. **Prepare the Dataset**: The **GSM8K dataset** is loaded and formatted with a **system prompt** and **XML-style reasoning and answer format**.
39
- 3. **Define Reward Functions**: Custom reward functions are defined to evaluate the model's responses.
40
- 4. **Train the Model**: The **GRPO trainer** is configured and run to fine-tune the model using the defined reward functions.
41
- 5. **Monitor Progress**: Training progress is monitored, including **rewards, completion length, and KL divergence**.
42
 
43
- ## Results
44
- The training process aims to enhance the modelโ€™s ability to generate **accurate and well-structured responses** to mathematical reasoning problems. The reward functions ensure the model improves over time, and training progress is logged for analysis.
45
 
46
- ## Future Work
47
- - **Hyperparameter Tuning**: Experimenting with learning rates, batch sizes, and reward weights to optimize performance.
48
- - **Additional Datasets**: Extending fine-tuning to other datasets for improved generalization.
49
- - **Advanced Reward Functions**: Implementing more sophisticated reward functions for refined responses.
50
 
51
- ## Acknowledgments
52
- - **Unsloth**: For providing tools to speed up fine-tuning.
53
- - **vLLM**: For enabling fast inference during training.
54
- - **Hugging Face**: For the `trl` library and the **GSM8K dataset**.
55
- - **Special thanks to @sudhir2016 sir** for mentoring and guiding this project.
56
 
57
- ## License
58
  This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.
 
 
 
 
 
1
+ # ๐Ÿš€ Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
2
+
3
+ ## ๐ŸŒŸ Introduction
4
+ This repository presents an optimized fine-tuning workflow for **Qwen2.5-3B-Instruct**, leveraging **GRPO (Generalized Reward Policy Optimization)** to enhance its mathematical reasoning skills on the **GSM8K dataset**. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency.
5
+
6
+ ## ๐Ÿ” Overview
7
+ The notebook follows a structured approach:
8
+
9
+ 1. **๐Ÿ”ง Installation**: Set up required libraries like `unsloth`, `vllm`, and `trl` for high-speed fine-tuning and inference.
10
+ 2. **โšก Unsloth Setup**: Optimize training with **Unslothโ€™s PatchFastRL**, and integrate **LoRA (Low-Rank Adaptation)** for memory-efficient tuning.
11
+ 3. **๐Ÿ“Š Data Preparation**: Preprocess the **GSM8K dataset** with a **system prompt** and **XML-structured reasoning and answer format**.
12
+ 4. **๐Ÿ† Reward Functions**: Implement multiple scoring mechanisms to refine model outputs:
13
+ - โœ… **Correctness Reward**: Validates whether the predicted answer matches the ground truth.
14
+ - ๐Ÿ“‘ **Format Reward**: Ensures compliance with the designated XML structure.
15
+ - ๐Ÿ”ข **Integer Reward**: Confirms that the extracted answer is a valid integer.
16
+ - ๐Ÿ—๏ธ **XML Count Reward**: Evaluates the completeness of the structured response.
17
+ 5. **๐ŸŽฏ GRPO Training**: Configure and execute **GRPO training** with **vLLM**, optimizing responses with reinforcement learning.
18
+ 6. **๐Ÿ“ˆ Training Progress**: Track essential metricsโ€”reward improvements, completion length, and KL divergenceโ€”to ensure steady performance gains.
19
+
20
+ ## โšก Key Features
21
+ - **๐Ÿš€ High-Efficiency Fine-Tuning**: Combines **Unsloth** and **LoRA** for fast, memory-efficient training.
22
+ - **๐Ÿ› ๏ธ Custom Reward Functions**: Fine-tunes responses using precision-oriented rewards.
23
+ - **โšก vLLM Integration**: Accelerates inference and reward-based optimization.
24
+ - **๐Ÿ“š GSM8K Dataset Focus**: Enhances problem-solving accuracy in mathematical reasoning tasks.
25
+
26
+ ## ๐Ÿ”ง Requirements
27
  - **Python 3.11**
28
  - Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers`
29
 
30
+ ## ๐Ÿ› ๏ธ Installation
31
+ Set up your environment with:
32
  ```bash
33
  pip install unsloth vllm trl
34
  ```
35
 
36
+ ## ๐ŸŽฌ Usage
37
+ 1. **Load the Model**: Initialize **Qwen2.5-3B-Instruct** with **LoRA** for fine-tuning.
38
+ 2. **Prepare the Dataset**: Format the **GSM8K dataset** using a structured **system prompt** and **XML reasoning style**.
39
+ 3. **Define Reward Functions**: Implement custom reward functions for guided learning.
40
+ 4. **Train the Model**: Run the **GRPO trainer** to optimize responses using reinforcement learning.
41
+ 5. **Monitor Progress**: Analyze rewards, response length, and performance trends in real-time.
42
 
43
+ ## ๐Ÿ“Š Results
44
+ The fine-tuning process refines the modelโ€™s ability to generate **precise and structured answers** to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking.
45
 
46
+ ## ๐Ÿ”ฎ Future Work
47
+ - **๐ŸŽฏ Hyperparameter Tuning**: Experimenting with learning rates, batch sizes, and reward weights for better outcomes.
48
+ - **๐Ÿ“‚ Additional Datasets**: Expanding fine-tuning to diverse datasets for broader generalization.
49
+ - **๐Ÿง  Advanced Reward Functions**: Developing sophisticated reward mechanisms to further enhance response quality.
50
 
51
+ ## ๐Ÿ™Œ Acknowledgments
52
+ - **Unsloth**: For pioneering accelerated fine-tuning.
53
+ - **vLLM**: For lightning-fast inference.
54
+ - **Hugging Face**: For providing the `trl` library and **GSM8K dataset**.
55
+ - **Special thanks to @sudhir2016 sir** for invaluable mentorship and guidance.
56
 
57
+ ## ๐Ÿ“œ License
58
  This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.
59
+
60
+ ---
61
+
62
+ โœจ Happy fine-tuning! Keep pushing AI boundaries! ๐Ÿš€