arianUniverse commited on
Commit
6363533
·
verified ·
1 Parent(s): 19d1034

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -6
README.md CHANGED
@@ -1,7 +1,110 @@
1
- ---
2
- license: mit
3
- datasets:
4
- - openai/gsm8k
5
- base_model:
6
- - Qwen/Qwen2.5-3B-Instruct
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ---
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - openai/gsm8k
5
+ base_model:
6
+ - Qwen/Qwen2.5-3B-Instruct
7
+ pipeline_tag: text-generation
8
+ ---
9
+
10
+ # Sir-Thinksalot
11
+
12
+ **Sir-Thinksalot** is a fine-tuned language model built on top of the Qwen2.5-3B-Instruct architecture. It was optimized using reinforcement learning techniques to encourage well-thought-out responses. Although the training process incorporated instructions to follow a detailed XML format, in practice the model outputs only the final answer.
13
+
14
+ ---
15
+
16
+ ## Model Description
17
+
18
+ Sir-Thinksalot was designed to ideally output responses in a structured XML format:
19
+
20
+ ```xml
21
+ <reasoning>
22
+ [Detailed reasoning...]
23
+ </reasoning>
24
+ <answer>
25
+ [Final answer...]
26
+ </answer>
27
+ ```
28
+
29
+ **Note:** Despite these training instructions, the deployed model currently outputs only the final answer (e.g., "Paris" for the question "What is the capital of France?") without the XML tags or the reasoning section.
30
+
31
+ ---
32
+
33
+ ## Training Summary
34
+
35
+ The model was trained using a reinforcement learning framework based on unsloth's GRPO (Generalized Reward Policy Optimization) with an integrated REINFORCE baseline. Key elements of the training include:
36
+
37
+ - **Base Model:** Training began with the pre-trained Qwen2.5-3B-Instruct model.
38
+ - **GRPO with REINFORCE Baseline:** The training utilizes unsloth's GRPO algorithm combined with a REINFORCE baseline. This baseline subtracts the mean reward from individual rewards, reducing variance and stabilizing training.
39
+ - **Multiple Reward Functions:** A series of reward functions were defined to:
40
+ - Encourage correctness of the final answer.
41
+ - Reward adherence to a specified XML format (even though the output now focuses solely on the final answer).
42
+ - Evaluate additional formatting aspects, such as numeric accuracy.
43
+ - **LoRA Fine-Tuning:** Low-Rank Adaptation (LoRA) was applied to specific target modules (e.g., `q_proj`, `k_proj`, `v_proj`, `o_proj`, etc.) to efficiently fine-tune the model.
44
+ - **Custom Dataset:** A variant of the GSM8K dataset was used with modified prompts that incorporated instructions for structured responses.
45
+ - **Monitoring:** The training process was logged using Weights & Biases (wandb) to track reward metrics and model performance.
46
+
47
+ This combination of techniques, especially the use of a REINFORCE baseline within GRPO, has been instrumental in guiding the model toward producing more correct and reliable final answers.
48
+
49
+ ---
50
+
51
+ ## How to Use
52
+
53
+ Sir-Thinksalot is available on the Hugging Face Hub and can be integrated into your projects with ease.
54
+
55
+ ### Installation
56
+
57
+ Ensure you have the necessary libraries installed:
58
+
59
+ ```bash
60
+ pip install huggingface_hub transformers
61
+ ```
62
+
63
+ ### Loading the Model
64
+
65
+ You can load the model and its tokenizer using the Hugging Face Transformers library:
66
+
67
+ ```python
68
+ from transformers import AutoTokenizer, AutoModelForCausalLM
69
+
70
+ # Load the tokenizer and model from the Hugging Face Hub
71
+ tokenizer = AutoTokenizer.from_pretrained("arianUniverse/Sir-Thinksalot")
72
+ model = AutoModelForCausalLM.from_pretrained("arianUniverse/Sir-Thinksalot")
73
+
74
+ # Example prompt
75
+ prompt = "What is the capital of France?"
76
+ inputs = tokenizer(prompt, return_tensors="pt")
77
+ output = model.generate(**inputs, max_new_tokens=50)
78
+
79
+ # Decode and print the final answer
80
+ response = tokenizer.decode(output[0], skip_special_tokens=True)
81
+ print(response)
82
+ ```
83
+
84
+ ### Expected Output
85
+
86
+ Given the current behavior of the model, if you ask:
87
+
88
+ ```
89
+ What is the capital of France?
90
+ ```
91
+
92
+ You can expect a direct answer such as:
93
+
94
+ ```
95
+ Paris
96
+ ```
97
+
98
+ without any structured `<reasoning>` or `<answer>` tags.
99
+
100
+ ---
101
+
102
+ ## Additional Information
103
+
104
+ - **Training Artifacts:** The model was trained using multiple reward functions, including those that encouraged XML structure, alongside a REINFORCE baseline to stabilize learning. However, the output is presently limited to the final answer.
105
+ - **System Prompt:** The repository contains a default system prompt used during training. While it provides insights into the intended formatting, the final deployment focuses solely on delivering the answer.
106
+ - **Post-Processing:** If you require a structured output, consider implementing a post-processing step that wraps the final answer into your desired XML format.
107
+
108
+ Feel free to explore Sir-Thinksalot further and adapt it to meet your specific application requirements!
109
+
110
  ---