Sir-Thinksalot
Sir-Thinksalot is a fine-tuned language model built on top of the Qwen2.5-3B-Instruct architecture. It was optimized using reinforcement learning techniques—specifically unsloth's GRPO with an integrated REINFORCE baseline—to encourage well-thought-out responses. Although the training process incorporated instructions to follow a detailed XML format, in practice the model outputs only the final answer.
Model Description
Sir-Thinksalot was designed to ideally output responses in a structured XML format:
<reasoning>
[Detailed reasoning...]
</reasoning>
<answer>
[Final answer...]
</answer>
Note: Despite these training instructions, the deployed model currently outputs only the final answer (e.g., "Paris" for the question "What is the capital of France?") without any additional formatting or reasoning tags.
Training Summary
The model was trained using a reinforcement learning framework based on unsloth's GRPO, enhanced with a REINFORCE baseline to reduce variance in the reward signal. Key aspects include:
- Base Model: The process started with the pre-trained Qwen2.5-3B-Instruct model.
- GRPO with REINFORCE Baseline: The training utilizes unsloth's GRPO algorithm with a REINFORCE baseline, which subtracts the average reward from individual rewards for improved stability.
- Multiple Reward Functions: A variety of reward functions were defined to encourage:
- The correctness of the final answer.
- Adherence to the specified XML formatting (even though the current output is just the answer).
- LoRA Fine-Tuning: Low-Rank Adaptation (LoRA) was applied to specific target modules (such as
q_proj
,k_proj
,v_proj
,o_proj
, etc.) for efficient fine-tuning. - Custom Dataset: A modified version of the GSM8K dataset was used, incorporating system instructions to produce structured outputs.
- Monitoring: Training was logged using Weights & Biases (wandb) to track performance and reward metrics.
How to Use
Sir-Thinksalot is available on the Hugging Face Hub and can be integrated into your projects. Below are usage instructions using the llama_cpp
library.
Installation
Install the llama-cpp-python
package:
pip install llama-cpp-python
Loading the Model with llama_cpp
You can load the model using llama_cpp
by specifying the GGUF file corresponding to the quantized model weights. For example:
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="arianUniverse/Sir-Thinksalot",
filename="unsloth.Q4_K_M.gguf",
)
result = llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)
print(result["choices"][0]["message"]["content"])
Expected Output
Given the current behavior of the model, if you ask:
What is the capital of France?
The output will be a direct answer such as:
Paris
without any structured <reasoning>
or <answer>
tags.
Additional Information
- Training Artifacts: The model was trained with multiple reward functions—including those encouraging XML structure—and uses a REINFORCE baseline within GRPO to stabilize training.
- System Prompt: Although the training system prompt includes detailed formatting instructions, the deployed model focuses solely on generating the final answer.
- Post-Processing: If you need a structured response, consider implementing a post-processing step to wrap the final answer in your desired XML format.
Feel free to explore Sir-Thinksalot further and adapt it to meet your application's requirements!
- Downloads last month
- 27
4-bit
5-bit
8-bit