DisCO-1.5B-Lratio
This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B on the agentica-org/DeepScaleR-Preview-Dataset.
It was fine-tuned as part of the paper DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization (paper link). Specifically, this model was fine-tuned by DisCO framework with Likelihood ratio (L-ratio) score function.
The code is available at: https://github.com/Optimization-AI/DisCO
Below are comparisons with baseline models and baseline methods for fine-tuning 1.5B models. OpenAI-o1-preview is included as a reference. MRL denotes Max Response Length utilized in training/testing. The bottom 9 methods are all for fine-tuning DeepSeek-R1-Distill-Qwen-1.5B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1, DSR is short for DeepScalaR.
Model | MRL(Train/Test) | AIME 2024 | AIME 2025 | MATH 500 | AMC 2023 | Minerva | O-Bench | Avg. |
---|---|---|---|---|---|---|---|---|
OpenAI-o1-Preview | - | 0.4 | - | 0.814 | - | - | - | - |
DS-Distill-Qwen-1.5B | 32k+ / 32k | 0.288 | 0.263 | 0.828 | 0.629 | 0.265 | 0.433 | 0.451 |
DS-Distill-Qwen-1.5B | 32k+ / 8k | 0.181 | 0.215 | 0.758 | 0.515 | 0.237 | 0.353 | 0.376 |
STILL-3-1.5B-preview | 29k / 32k | 0.325 | 0.248 | 0.844 | 0.667 | 0.290 | 0.454 | 0.471 |
DSR-1.5B-Preview | 24k / 32k | 0.431 | 0.304 | 0.878 | 0.736 | 0.302 | 0.500 | 0.525 |
DSR-1.5B-Preview | 24k / 8k | 0.358 | 0.258 | 0.860 | 0.679 | 0.297 | 0.473 | 0.488 |
GRPO | 8k / 8k | 0.277 | 0.242 | 0.838 | 0.647 | 0.276 | 0.462 | 0.457 |
GRPO+ER | 8k / 8k | 0.298 | 0.242 | 0.839 | 0.649 | 0.279 | 0.452 | 0.460 |
Dr. GRPO | 8k / 8k | 0.250 | 0.238 | 0.830 | 0.629 | 0.270 | 0.443 | 0.443 |
DAPO | 8k / 8k | 0.310 | 0.252 | 0.848 | 0.675 | 0.296 | 0.456 | 0.473 |
TRPA | 8k / 8k | 0.354 | 0.235 | 0.835 | 0.653 | 0.283 | 0.458 | 0.470 |
DisCO (L-ratio) | 8k / 8k | 0.381 | 0.306 | 0.878 | 0.746 | 0.319 | 0.512 | 0.524 |
DisCO (log-L) | 8k / 8k | 0.404 | 0.317 | 0.876 | 0.758 | 0.333 | 0.509 | 0.533 |
Citation
@article{li2025disco,
title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
journal={arXiv preprint arXiv:2505.12366},
year={2025}
}
- Downloads last month
- 13