DisCO-1.5B-Lratio

This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B on the agentica-org/DeepScaleR-Preview-Dataset.

It was fine-tuned as part of the paper DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization (paper link). Specifically, this model was fine-tuned by DisCO framework with Likelihood ratio (L-ratio) score function.

The code is available at: https://github.com/Optimization-AI/DisCO

Below are comparisons with baseline models and baseline methods for fine-tuning 1.5B models. OpenAI-o1-preview is included as a reference. MRL denotes Max Response Length utilized in training/testing. The bottom 9 methods are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR.

Model MRL(Train/Test) AIME 2024 AIME 2025 MATH 500 AMC 2023 Minerva O-Bench Avg.
OpenAI-o1-Preview - 0.4 - 0.814 - - - -
DS-Distill-Qwen-1.5B 32k+ / 32k 0.288 0.263 0.828 0.629 0.265 0.433 0.451
DS-Distill-Qwen-1.5B 32k+ / 8k 0.181 0.215 0.758 0.515 0.237 0.353 0.376
STILL-3-1.5B-preview 29k / 32k 0.325 0.248 0.844 0.667 0.290 0.454 0.471
DSR-1.5B-Preview 24k / 32k 0.431 0.304 0.878 0.736 0.302 0.500 0.525
DSR-1.5B-Preview 24k / 8k 0.358 0.258 0.860 0.679 0.297 0.473 0.488
GRPO 8k / 8k 0.277 0.242 0.838 0.647 0.276 0.462 0.457
GRPO+ER 8k / 8k 0.298 0.242 0.839 0.649 0.279 0.452 0.460
Dr. GRPO 8k / 8k 0.250 0.238 0.830 0.629 0.270 0.443 0.443
DAPO 8k / 8k 0.310 0.252 0.848 0.675 0.296 0.456 0.473
TRPA 8k / 8k 0.354 0.235 0.835 0.653 0.283 0.458 0.470
DisCO (L-ratio) 8k / 8k 0.381 0.306 0.878 0.746 0.319 0.512 0.524
DisCO (log-L) 8k / 8k 0.404 0.317 0.876 0.758 0.333 0.509 0.533

Citation

@article{li2025disco,
  title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
  author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
  journal={arXiv preprint arXiv:2505.12366},
  year={2025}
}
Downloads last month
13
Safetensors
Model size
1.78B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ganglii/DisCO-1.5B-Lratio

Quantizations
1 model