Update README.md
Browse files
README.md
CHANGED
@@ -1,58 +1,62 @@
|
|
1 |
-
# Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
|
2 |
-
|
3 |
-
## Introduction
|
4 |
-
This repository
|
5 |
-
|
6 |
-
## Overview
|
7 |
-
The notebook
|
8 |
-
|
9 |
-
1.
|
10 |
-
2.
|
11 |
-
3.
|
12 |
-
4.
|
13 |
-
- **Correctness Reward**:
|
14 |
-
- **Format Reward**: Ensures
|
15 |
-
- **Integer Reward**:
|
16 |
-
- **XML Count Reward**: Evaluates the completeness of the
|
17 |
-
5.
|
18 |
-
6.
|
19 |
-
|
20 |
-
## Key Features
|
21 |
-
-
|
22 |
-
-
|
23 |
-
-
|
24 |
-
-
|
25 |
-
|
26 |
-
## Requirements
|
27 |
- **Python 3.11**
|
28 |
- Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers`
|
29 |
|
30 |
-
## Installation
|
31 |
-
|
32 |
```bash
|
33 |
pip install unsloth vllm trl
|
34 |
```
|
35 |
|
36 |
-
## Usage
|
37 |
-
1. **Load the Model**:
|
38 |
-
2. **Prepare the Dataset**:
|
39 |
-
3. **Define Reward Functions**:
|
40 |
-
4. **Train the Model**:
|
41 |
-
5. **Monitor Progress**:
|
42 |
|
43 |
-
## Results
|
44 |
-
The
|
45 |
|
46 |
-
## Future Work
|
47 |
-
-
|
48 |
-
-
|
49 |
-
-
|
50 |
|
51 |
-
## Acknowledgments
|
52 |
-
- **Unsloth**: For
|
53 |
-
- **vLLM**: For
|
54 |
-
- **Hugging Face**: For the `trl` library and
|
55 |
-
- **Special thanks to @sudhir2016 sir** for
|
56 |
|
57 |
-
## License
|
58 |
This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.
|
|
|
|
|
|
|
|
|
|
1 |
+
# ๐ Fine-Tuning Qwen2.5-3B-Instruct with GRPO for GSM8K Dataset
|
2 |
+
|
3 |
+
## ๐ Introduction
|
4 |
+
This repository presents an optimized fine-tuning workflow for **Qwen2.5-3B-Instruct**, leveraging **GRPO (Generalized Reward Policy Optimization)** to enhance its mathematical reasoning skills on the **GSM8K dataset**. By integrating reinforcement learning and custom reward functions, this project pushes the boundaries of problem-solving efficiency.
|
5 |
+
|
6 |
+
## ๐ Overview
|
7 |
+
The notebook follows a structured approach:
|
8 |
+
|
9 |
+
1. **๐ง Installation**: Set up required libraries like `unsloth`, `vllm`, and `trl` for high-speed fine-tuning and inference.
|
10 |
+
2. **โก Unsloth Setup**: Optimize training with **Unslothโs PatchFastRL**, and integrate **LoRA (Low-Rank Adaptation)** for memory-efficient tuning.
|
11 |
+
3. **๐ Data Preparation**: Preprocess the **GSM8K dataset** with a **system prompt** and **XML-structured reasoning and answer format**.
|
12 |
+
4. **๐ Reward Functions**: Implement multiple scoring mechanisms to refine model outputs:
|
13 |
+
- โ
**Correctness Reward**: Validates whether the predicted answer matches the ground truth.
|
14 |
+
- ๐ **Format Reward**: Ensures compliance with the designated XML structure.
|
15 |
+
- ๐ข **Integer Reward**: Confirms that the extracted answer is a valid integer.
|
16 |
+
- ๐๏ธ **XML Count Reward**: Evaluates the completeness of the structured response.
|
17 |
+
5. **๐ฏ GRPO Training**: Configure and execute **GRPO training** with **vLLM**, optimizing responses with reinforcement learning.
|
18 |
+
6. **๐ Training Progress**: Track essential metricsโreward improvements, completion length, and KL divergenceโto ensure steady performance gains.
|
19 |
+
|
20 |
+
## โก Key Features
|
21 |
+
- **๐ High-Efficiency Fine-Tuning**: Combines **Unsloth** and **LoRA** for fast, memory-efficient training.
|
22 |
+
- **๐ ๏ธ Custom Reward Functions**: Fine-tunes responses using precision-oriented rewards.
|
23 |
+
- **โก vLLM Integration**: Accelerates inference and reward-based optimization.
|
24 |
+
- **๐ GSM8K Dataset Focus**: Enhances problem-solving accuracy in mathematical reasoning tasks.
|
25 |
+
|
26 |
+
## ๐ง Requirements
|
27 |
- **Python 3.11**
|
28 |
- Required Libraries: `unsloth`, `vllm`, `trl`, `torch`, `transformers`
|
29 |
|
30 |
+
## ๐ ๏ธ Installation
|
31 |
+
Set up your environment with:
|
32 |
```bash
|
33 |
pip install unsloth vllm trl
|
34 |
```
|
35 |
|
36 |
+
## ๐ฌ Usage
|
37 |
+
1. **Load the Model**: Initialize **Qwen2.5-3B-Instruct** with **LoRA** for fine-tuning.
|
38 |
+
2. **Prepare the Dataset**: Format the **GSM8K dataset** using a structured **system prompt** and **XML reasoning style**.
|
39 |
+
3. **Define Reward Functions**: Implement custom reward functions for guided learning.
|
40 |
+
4. **Train the Model**: Run the **GRPO trainer** to optimize responses using reinforcement learning.
|
41 |
+
5. **Monitor Progress**: Analyze rewards, response length, and performance trends in real-time.
|
42 |
|
43 |
+
## ๐ Results
|
44 |
+
The fine-tuning process refines the modelโs ability to generate **precise and structured answers** to mathematical problems. The reward system ensures consistent improvement, with training metrics logged for performance tracking.
|
45 |
|
46 |
+
## ๐ฎ Future Work
|
47 |
+
- **๐ฏ Hyperparameter Tuning**: Experimenting with learning rates, batch sizes, and reward weights for better outcomes.
|
48 |
+
- **๐ Additional Datasets**: Expanding fine-tuning to diverse datasets for broader generalization.
|
49 |
+
- **๐ง Advanced Reward Functions**: Developing sophisticated reward mechanisms to further enhance response quality.
|
50 |
|
51 |
+
## ๐ Acknowledgments
|
52 |
+
- **Unsloth**: For pioneering accelerated fine-tuning.
|
53 |
+
- **vLLM**: For lightning-fast inference.
|
54 |
+
- **Hugging Face**: For providing the `trl` library and **GSM8K dataset**.
|
55 |
+
- **Special thanks to @sudhir2016 sir** for invaluable mentorship and guidance.
|
56 |
|
57 |
+
## ๐ License
|
58 |
This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.
|
59 |
+
|
60 |
+
---
|
61 |
+
|
62 |
+
โจ Happy fine-tuning! Keep pushing AI boundaries! ๐
|