Sir-Thinksalot

Sir-Thinksalot is a fine-tuned language model built on top of the Qwen2.5-3B-Instruct architecture. It was optimized using reinforcement learning techniques—specifically unsloth's GRPO with an integrated REINFORCE baseline—to encourage well-thought-out responses. Although the training process incorporated instructions to follow a detailed XML format, in practice the model outputs only the final answer.


Model Description

Sir-Thinksalot was designed to ideally output responses in a structured XML format:

<reasoning>
[Detailed reasoning...]
</reasoning>
<answer>
[Final answer...]
</answer>

Note: Despite these training instructions, the deployed model currently outputs only the final answer (e.g., "Paris" for the question "What is the capital of France?") without any additional formatting or reasoning tags.


Training Summary

The model was trained using a reinforcement learning framework based on unsloth's GRPO, enhanced with a REINFORCE baseline to reduce variance in the reward signal. Key aspects include:

  • Base Model: The process started with the pre-trained Qwen2.5-3B-Instruct model.
  • GRPO with REINFORCE Baseline: The training utilizes unsloth's GRPO algorithm with a REINFORCE baseline, which subtracts the average reward from individual rewards for improved stability.
  • Multiple Reward Functions: A variety of reward functions were defined to encourage:
    • The correctness of the final answer.
    • Adherence to the specified XML formatting (even though the current output is just the answer).
  • LoRA Fine-Tuning: Low-Rank Adaptation (LoRA) was applied to specific target modules (such as q_proj, k_proj, v_proj, o_proj, etc.) for efficient fine-tuning.
  • Custom Dataset: A modified version of the GSM8K dataset was used, incorporating system instructions to produce structured outputs.
  • Monitoring: Training was logged using Weights & Biases (wandb) to track performance and reward metrics.

How to Use

Sir-Thinksalot is available on the Hugging Face Hub and can be integrated into your projects. Below are usage instructions using the llama_cpp library.

Installation

Install the llama-cpp-python package:

pip install llama-cpp-python

Loading the Model with llama_cpp

You can load the model using llama_cpp by specifying the GGUF file corresponding to the quantized model weights. For example:

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="arianUniverse/Sir-Thinksalot",
    filename="unsloth.Q4_K_M.gguf",
)

result = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ]
)

print(result["choices"][0]["message"]["content"])

Expected Output

Given the current behavior of the model, if you ask:

What is the capital of France?

The output will be a direct answer such as:

Paris

without any structured <reasoning> or <answer> tags.


Additional Information

  • Training Artifacts: The model was trained with multiple reward functions—including those encouraging XML structure—and uses a REINFORCE baseline within GRPO to stabilize training.
  • System Prompt: Although the training system prompt includes detailed formatting instructions, the deployed model focuses solely on generating the final answer.
  • Post-Processing: If you need a structured response, consider implementing a post-processing step to wrap the final answer in your desired XML format.

Feel free to explore Sir-Thinksalot further and adapt it to meet your application's requirements!


Downloads last month
27
GGUF
Model size
3.09B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for arianUniverse/Sir-Thinksalot

Base model

Qwen/Qwen2.5-3B
Quantized
(127)
this model

Dataset used to train arianUniverse/Sir-Thinksalot