OpenRS-Star

OpenRS-Star extends the OpenRS project and shows that reinforcement learning can further improve reasoning in small LLMs under tight compute constraints.

This model fine-tunes Qwen3-1.7B using a two-stage length training approach and DAPO-style optimizations on a 7,000-sample mathematical reasoning dataset.
Training was completed using 2× A100s and 2× H200s, for a total cost of under $100.


Key Contributions

Improved Performance

  • AIME24: 50.0% (+13.3% over base model)
  • AMC23: 82.5% (+5% over base model)
  • Consistent or slightly improved results on MATH-500, OlympiadBench, and Minerva.

Multi-Stage Fine-Tuning

  • Stage 1: 4k-token completions (50 PPO steps)
  • Stage 2: 8k-token completions (38 PPO steps)

Optimizations

Applied GRPO + DAPO-style tricks for stability and learning signal quality:

  • Clip-Higher
  • Pure Accuracy Reward
  • Reward masking for truncated answers
  • Token-average loss
  • Dynamic sampling filter

Efficient Training

  • Total compute cost: under $100
  • Training completed in fewer than 100 PPO steps total

For full details, see the GitHub repository.

Math Benchmarks Results

AIME24 vs Training Costs

Downloads last month
11
Safetensors
Model size
2.03B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oanaflores/OpenRS-Star

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(191)
this model

Dataset used to train oanaflores/OpenRS-Star