arianUniverse
/

Sir-Thinksalot

+---
+license: apache-2.0
+datasets:
+- openai/gsm8k
+base_model:
+- Qwen/Qwen2.5-3B-Instruct
+pipeline_tag: text-generation
+---
+# Sir-Thinksalot
+**Sir-Thinksalot** is a fine-tuned language model built on top of the Qwen2.5-3B-Instruct architecture. It was optimized using reinforcement learning techniques to encourage well-thought-out responses. Although the training process incorporated instructions to follow a detailed XML format, in practice the model outputs only the final answer.
+---
+## Model Description
+Sir-Thinksalot was designed to ideally output responses in a structured XML format:
+```xml
+<reasoning>
+[Detailed reasoning...]
+</reasoning>
+<answer>
+[Final answer...]
+</answer>
+```
+**Note:** Despite these training instructions, the deployed model currently outputs only the final answer (e.g., "Paris" for the question "What is the capital of France?") without the XML tags or the reasoning section.
+---
+## Training Summary
+The model was trained using a reinforcement learning framework based on unsloth's GRPO (Generalized Reward Policy Optimization) with an integrated REINFORCE baseline. Key elements of the training include:
+- **Base Model:** Training began with the pre-trained Qwen2.5-3B-Instruct model.
+- **GRPO with REINFORCE Baseline:** The training utilizes unsloth's GRPO algorithm combined with a REINFORCE baseline. This baseline subtracts the mean reward from individual rewards, reducing variance and stabilizing training.
+- **Multiple Reward Functions:** A series of reward functions were defined to:
+  - Encourage correctness of the final answer.
+  - Reward adherence to a specified XML format (even though the output now focuses solely on the final answer).
+  - Evaluate additional formatting aspects, such as numeric accuracy.
+- **LoRA Fine-Tuning:** Low-Rank Adaptation (LoRA) was applied to specific target modules (e.g., `q_proj`, `k_proj`, `v_proj`, `o_proj`, etc.) to efficiently fine-tune the model.
+- **Custom Dataset:** A variant of the GSM8K dataset was used with modified prompts that incorporated instructions for structured responses.
+- **Monitoring:** The training process was logged using Weights & Biases (wandb) to track reward metrics and model performance.
+This combination of techniques, especially the use of a REINFORCE baseline within GRPO, has been instrumental in guiding the model toward producing more correct and reliable final answers.
+---
+## How to Use
+Sir-Thinksalot is available on the Hugging Face Hub and can be integrated into your projects with ease.
+### Installation
+Ensure you have the necessary libraries installed:
+```bash
+pip install huggingface_hub transformers
+```
+### Loading the Model
+You can load the model and its tokenizer using the Hugging Face Transformers library:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load the tokenizer and model from the Hugging Face Hub
+tokenizer = AutoTokenizer.from_pretrained("arianUniverse/Sir-Thinksalot")
+model = AutoModelForCausalLM.from_pretrained("arianUniverse/Sir-Thinksalot")
+# Example prompt
+prompt = "What is the capital of France?"
+inputs = tokenizer(prompt, return_tensors="pt")
+output = model.generate(**inputs, max_new_tokens=50)
+# Decode and print the final answer
+response = tokenizer.decode(output[0], skip_special_tokens=True)
+print(response)
+```
+### Expected Output
+Given the current behavior of the model, if you ask:
+```
+What is the capital of France?
+```
+You can expect a direct answer such as:
+```
+Paris
+```
+without any structured `<reasoning>` or `<answer>` tags.
+---
+## Additional Information
+- **Training Artifacts:** The model was trained using multiple reward functions, including those that encouraged XML structure, alongside a REINFORCE baseline to stabilize learning. However, the output is presently limited to the final answer.
+- **System Prompt:** The repository contains a default system prompt used during training. While it provides insights into the intended formatting, the final deployment focuses solely on delivering the answer.
+- **Post-Processing:** If you require a structured output, consider implementing a post-processing step that wraps the final answer into your desired XML format.
+Feel free to explore Sir-Thinksalot further and adapt it to meet your specific application requirements!
 ---