ganglii/DisCO-7B-Lratio · Hugging Face

DisCO-7B-Lratio

This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B on the agentica-org/DeepScaleR-Preview-Dataset.

It was fine-tuned as part of the paper DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization (paper link). Specifically, this model was fine-tuned by DisCO framework with Likelihood ratio (L-ratio) score function.

The code is available at: https://github.com/Optimization-AI/DisCO

Below are comparisons with baseline models and baseline methods for fine-tuning 7B models. MRL denotes Max Response Length utilized in training/testing. The bottom 7 methods are all for fine-tuning DeepSeek-R1-Distill-Qwen-7B model on the same DeepScaleR dataset. DS is short for DeepSeek-R1.

Model	MRL(Train/Test)	AIME 2024	AIME 2025	MATH 500	AMC 2023	Minerva	O-Bench	Avg.
DS-Distill-Qwen-7B	32k+ / 32k	0.560	0.396	0.923	0.825	0.380	0.568	0.609
DS-Distill-Qwen-7B	32k+ / 8k	0.402	0.292	0.873	0.688	0.355	0.471	0.513
GRPO-LEAD-7B	8k / 8k	0.470	0.345	0.893	0.748	0.372	0.500	0.555
TRPA	8k / 8k	0.570	-	0.870	0.780	0.360	0.550	-
GRPO	8k / 8k	0.498	0.394	0.916	0.807	0.381	0.555	0.592
GRPO+ER	8k / 8k	0.515	0.381	0.916	0.825	0.376	0.544	0.593
Dr. GRPO	8k / 8k	0.488	0.346	0.910	0.792	0.368	0.546	0.575
DAPO	8k / 8k	0.454	0.335	0.907	0.799	0.388	0.535	0.570
TRPA	8k / 8k	0.510	0.367	0.898	0.779	0.379	0.534	0.578
DisCO (L-ratio)	8k / 8k	0.583	0.421	0.923	0.852	0.399	0.585	0.627
DisCO (log-L)	8k / 8k	0.558	0.410	0.927	0.854	0.410	0.592	0.625

Citation

@article{li2025disco,
  title={DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization},
  author={Li, Gang and Lin, Ming and Galanti, Tomer and Tu, Zhengzhong and Yang, Tianbao},
  journal={arXiv preprint arXiv:2505.12366},
  year={2025}
}