SmolLM2 1.7b Aligned and Reinforced Through Tulu 3!

SmolTulu Banner

SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of SmolTulu-1.7b-Instruct, which leverages AllenAI's Tulu 3 post-training pipeline

This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.

Evaluation

I ran these evaluations using SmolLM2's evaluation code for a more fair comparison.

Metric SmolTulu-1.7b-Instruct SmolTulu-1.7b-Reinforced SmolLM2-1.7B-Instruct Llama-1B-Instruct Qwen2.5-1.5B-Instruct SmolLM1-1.7B-Instruct
ARC (Average) 51.5 51.1 51.7 41.6 46.2 43.7
BBH (3-shot) 33.8 33.4 32.2 27.6 35.3 25.7
GSM8K (5-shot) 51.6 61.0 48.2 26.8 42.8 4.6
HellaSwag 61.1 60.4 66.1 56.1 60.9 55.5
IFEval (Average prompt/inst) 67.7 69.3 56.7 53.5 47.4 23.1
MMLU-Pro (MCF) 17.4 17.3 19.3 12.7 24.2 11.7
PIQA 72.2 72.1 74.4 72.3 73.2 71.6

Training Details

The reinforced model used PPO with verifiable rewards:

  • Base model: SmolTulu-1.7b-Instruct
  • Learning rate: 3e-6
  • Total training episodes: 10M
  • PPO KL penalty coefficient (beta): 0.05
  • Maximum sequence/prompt length: 2048 tokens
  • Response length: 2048 tokens
  • Rollout batch size: 32
  • Minibatch size: 32
  • Temperature: 1.0
  • Penalty reward: -10.0 for incomplete generations
  • DeepSpeed Stage 3 optimization
  • Gradient checkpointing enabled
  • Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
  • Reward model multiplier: 0.0 (pure verifiable rewards)

Usage

Just like any Huggingface model, just run it using the transformers library:

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "SultanR/SmolTulu-1.7b-Reinforced"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Citation

@misc{alrashed2024smoltuluhigherlearningrate,
      title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs}, 
      author={Sultan Alrashed},
      year={2024},
      eprint={2412.08347},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.08347}, 
}

The training methodology follows the Tulu 3 paper:

@article{lambert2024tulu3,
  title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training},
  author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others},
  year={2024},
  journal={arXiv preprint arXiv:2411.15124}
}
Downloads last month
40
Safetensors
Model size
1.71B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for SultanR/SmolTulu-1.7b-Reinforced

Finetuned
(2)
this model
Quantizations
3 models

Dataset used to train SultanR/SmolTulu-1.7b-Reinforced

Collection including SultanR/SmolTulu-1.7b-Reinforced