DisCO-7B-Lratio

This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on the agentica-org/DeepScaleR-Preview-Dataset.

It was fine-tuned as part of the paper DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization (paper link). Specifically, this model was fine-tuned by DisCO framework with Likelihood ratio (L-ratio) score function.

The code is available at: https://github.com/Optimization-AI/DisCO

Below are comparisons with baseline models and baseline methods for fine-tuning 7B models. MRL denotes Max Response Length utilized in training/testing. The bottom 7 methods are all for fine-tuning DeepSeek-R1-Distill-Qwen-7B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1.

Model MRL(Train/Test) AIME 2024 AIME 2025 MATH 500 AMC 2023 Minerva O-Bench Avg.
DS-Distill-Qwen-7B 32k+   /   32k 0.560 0.396 0.923 0.825 0.380 0.568 0.609
DS-Distill-Qwen-7B 32k+  /   8k 0.402 0.292 0.873 0.688 0.355 0.471 0.513
GRPO-LEAD-7B 8k   /   8k 0.470 0.345 0.893 0.748 0.372 0.500 0.555
TRPA 8k   /   8k 0.570 - 0.870 0.780 0.360 0.550 -
GRPO 8k   /    8k 0.498 0.394 0.916 0.807 0.381 0.555 0.592
GRPO+ER 8k   /   8k 0.515 0.381 0.916 0.825 0.376 0.544 0.593
Dr. GRPO 8k   /   8k 0.488 0.346 0.910 0.792 0.368 0.546 0.575
DAPO 8k   /   8k 0.454 0.335 0.907 0.799 0.388 0.535 0.570
TRPA 8k   /   8k 0.510 0.367 0.898 0.779 0.379 0.534 0.578
DisCO (L-ratio) 8k   /   8k 0.583 0.421 0.923 0.852 0.399 0.585 0.627
DisCO (log-L) 8k   /   8k 0.558 0.410 0.927 0.854 0.410 0.592 0.625

Citation

@article{li2025disco,
  title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
  author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
  journal={arXiv preprint arXiv:2505.12366},
  year={2025}
}
Downloads last month
8
Safetensors
Model size
7.62B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support