File size: 3,824 Bytes

---
datasets:
- openai/gsm8k
base_model:
- Qwen/Qwen2.5-3B-Instruct
pipeline_tag: text-generation
license: other
---

# Sir-Thinksalot

**Sir-Thinksalot** is a fine-tuned language model built on top of the Qwen2.5-3B-Instruct architecture. It was optimized using reinforcement learning techniques—specifically unsloth's GRPO with an integrated REINFORCE baseline—to encourage well-thought-out responses. Although the training process incorporated instructions to follow a detailed XML format, in practice the model outputs only the final answer.

---

## Model Description

Sir-Thinksalot was designed to ideally output responses in a structured XML format:

```xml
<reasoning>
[Detailed reasoning...]
</reasoning>
<answer>
[Final answer...]
</answer>
```

**Note:** Despite these training instructions, the deployed model currently outputs only the final answer (e.g., "Paris" for the question "What is the capital of France?") without any additional formatting or reasoning tags.

---

## Training Summary

The model was trained using a reinforcement learning framework based on unsloth's GRPO, enhanced with a REINFORCE baseline to reduce variance in the reward signal. Key aspects include:

- **Base Model:** The process started with the pre-trained Qwen2.5-3B-Instruct model.
- **GRPO with REINFORCE Baseline:** The training utilizes unsloth's GRPO algorithm with a REINFORCE baseline, which subtracts the average reward from individual rewards for improved stability.
- **Multiple Reward Functions:** A variety of reward functions were defined to encourage:
  - The correctness of the final answer.
  - Adherence to the specified XML formatting (even though the current output is just the answer).
- **LoRA Fine-Tuning:** Low-Rank Adaptation (LoRA) was applied to specific target modules (such as `q_proj`, `k_proj`, `v_proj`, `o_proj`, etc.) for efficient fine-tuning.
- **Custom Dataset:** A modified version of the GSM8K dataset was used, incorporating system instructions to produce structured outputs.
- **Monitoring:** Training was logged using Weights & Biases (wandb) to track performance and reward metrics.

---

## How to Use

Sir-Thinksalot is available on the Hugging Face Hub and can be integrated into your projects. Below are usage instructions using the `llama_cpp` library.

### Installation

Install the `llama-cpp-python` package:

```bash
pip install llama-cpp-python
```

### Loading the Model with llama_cpp

You can load the model using `llama_cpp` by specifying the GGUF file corresponding to the quantized model weights. For example:

```python
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="arianUniverse/Sir-Thinksalot",
    filename="unsloth.Q4_K_M.gguf",
)

result = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ]
)

print(result["choices"][0]["message"]["content"])
```

### Expected Output

Given the current behavior of the model, if you ask:

```
What is the capital of France?
```

The output will be a direct answer such as:

```
Paris
```

without any structured `<reasoning>` or `<answer>` tags.

---

## Additional Information

- **Training Artifacts:** The model was trained with multiple reward functions—including those encouraging XML structure—and uses a REINFORCE baseline within GRPO to stabilize training.
- **System Prompt:** Although the training system prompt includes detailed formatting instructions, the deployed model focuses solely on generating the final answer.
- **Post-Processing:** If you need a structured response, consider implementing a post-processing step to wrap the final answer in your desired XML format.

Feel free to explore Sir-Thinksalot further and adapt it to meet your application's requirements!

---