π€ Model Card: InfiX-ai/InfiAlign-Qwen-7B-DPO
InfiAlign is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models.
At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources.
When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks.
Additional improvements are obtained through the application of Direct Preference Optimization (DPO), with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks.
π InfiAlign Model Series
The InfiAlign framework offers multiple variants tailored for different alignment strategies:
- InfiAlign-Qwen-7B-SFT: Fine-tuned using curriculum-style instruction data.
- InfiAlign-Qwen-7B-DPO: Trained with Direct Preference Optimization (DPO) to improve reasoning alignment. [You are here!]
- InfiAlign-Qwen-7B-R1: Reinforcement learning variant (GRPO) for further refinement.
π Model Description
- Model Name: InfiAlign-Qwen-7B-DPO
- Developed by: InfiX-ai
- Fine-tuned from: InfiAlign-Qwen-7B-SFT
- Model Type: 7B-parameter decoder-only Transformer
- Context Length: 32K tokens
- License: Apache 2.0
- Status: Static checkpoint (offline training)
ποΈ Training Details
π Dataset Overview
Total of 10K curated samples across three core reasoning domains:
Domain | Curated Samples |
---|---|
Mathematics | 3.5K |
Code | 3.5K |
Science | 3K |
Each sample includes preference-ranked completions distilled from stronger teacher models, selected for difficulty and diversity.
Data Source: OpenMathReasoning, Mixture-of-Thoughts, OpenScience
π Data Pipeline
- Data Decontamination and Deduplication: Decontaminate data against evaluation benchmarks and deduplicate samples from the SFT training dataset.
- Data Selection: We first utilize Qwen2.5-32B-Instruct to annotate each sample with domain-specific labels. For each category, we select the problems with the longest solution, representing the most challenging problems. Our SFT model then generates responses for these selected problems, which are used in the subsequent rejection sampling step.
- Reject Sampling: We employ Qwen2.5-32B-Instruct to evaluate the SFT model's responses to math and science questions, and utilize an internal sandbox service to verify the correctness of code-related answers. For each domain, we select false samples with the longest solution lengths from each category, ensuring a balanced number of samples across categories. We directly use the solutions generated by DeepSeek-R1 as the positive samples, and pair them with the selected false samples to construct training pairs.
ποΈ Training Procedure
π§ Alignment Algorithm: Direct Preference Optimization (DPO)
βοΈ Training Hyperparameters:
Hyperparameter | Value |
---|---|
Batch Size | 16 |
Learning Rate | 5e-7 |
LR Scheduler | cosine |
Warmup Ratio | 0.1 |
Epoch | 3 |
Sequence Parallelism | 4 |
Loss | sigmoid preference loss |
Preference Beta | 0.1 |
π Evaluation
We evaluate InfiAlign-Qwen-7B-DPO on a range of benchmarks to assess its reasoning, problem-solving, and code generation capabilities. All metrics are reported as Pass@1 under a consistent regex-based answer extraction pipeline, adapted from LIMO.
π§ͺ Benchmark Overview
- AIME24 / AIME25: American Invitational Mathematics Examination problems (Olympiad-level high school math).
- MATH500: Subset of the MATH dataset focused on complex mathematical reasoning.
- GPQA (Graduate Physics QA): Advanced physics multiple-choice questions.
- MMLU-Pro: Professional-level subset of the Massive Multitask Language Understanding benchmark.
- LiveCodeBench: Code reasoning benchmark using real-world coding problems.
π Performance Comparison (Pass@1)
Model | Initial CKPT | Data Size | AIME 2025 (avg@64) |
AIME 2024 (avg@64) |
MATH500 (avg@4) |
GPQA Diamond (avg@8) |
MMLU-Pro (pass@1) |
LiveCodeBench-v5 (avg@8) |
Avg. |
---|---|---|---|---|---|---|---|---|---|
Qwen2.5-7B-Instruct | Qwen2.5-7B-Base | 1M | 8.80 | 11.93 | 76.15 | 38.70 | 57.49 | 15.77 | 34.80 |
Qwen2.5-Math-7B-Instruct | Qwen2.5-7B-Math-Base | 2.5M | 6.72 | 6.67 | 82.40 | 31.12 | 43.06 | 2.68 | 28.78 |
DeepSeek-Distill-Qwen-7B | Qwen2.5-7B-Math-Base | 800K | 37.97 | 55.50* | 92.80* | 49.10* | 54.16 | 37.60* | 54.43 |
OpenThinker2-7B | Qwen2.5-7B-Instruct | 1M | 38.70* | 60.70* | 87.60* | 47.00* | 40.60* | 37.50 | 52.01 |
Light-R1-7B-DS | DeepSeek-Distill-Qwen-7B | 3K | 44.30* | 59.10* | 91.35 | 49.40* | 54.95 | 38.40 | 56.25 |
InfiAlign-Qwen-7B-SFT-92K (ours) | Qwen2.5-7B-Math-Base | 92K | 43.39 | 56.46 | 92.35 | 48.48 | 53.51 | 34.05 | 54.70 |
InfiAlign-Qwen-7B-DPO-9K (ours) | InfiAlign-Qwen-7B-SFT-92K | 9K | 44.06 | 61.04 | 91.95 | 48.17 | 49.90 | 34.54 | 54.94 |
InfiAlign-Qwen-7B-SFT-165K (ours) π€ | Qwen2.5-7B-Math-Base | 165K | 42.19 | 63.75 | 92.70 | 53.60 | 56.68 | 36.20 | 57.52 |
InfiAlign-Qwen-7B-DPO-10K (ours) π€ | InfiAlign-Qwen-7B-SFT-165K | 10K | 47.45 | 61.25 | 93.45 | 51.77 | 53.95 | 35.30 | 57.20 |
π§ͺ Usage
Here is a code snippet with apply_chat_template showing you how to load the tokenizer and model and how to generate content.
- PS: Make sure the model starts with "<think>\n" to avoid generating empty thoughts, which will reduce the output quality. If you use "apply_chat_template" and set "add_generation_prompt=True", this will be automatically implemented, but this may result in a missing "<think>" label at the beginning of the response.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "InfiX-ai/InfiAlign-Qwen-7B-SFT"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
π― Intended Uses
β Appropriate Use Cases
- Reasoning tasks in math, science, and code
- Chat-based AI assistants requiring structured problem-solving
- Educational and research tools focused on logic-based domains
β Out-of-Scope Uses
- High-stakes applications (e.g., legal, medical)
- Non-English or multilingual scenarios (model is primarily trained on English)
- Tasks not related to reasoning or logic-intensive domains
βοΈ Bias, Risks, and Limitations
π Bias
- English-centric training may result in underperformance on non-English tasks
- Potential propagation of stereotypes or social biases from source data
β οΈ Risks
- May produce hallucinated or incorrect outputs
- Risk of unsafe or offensive completions in adversarial contexts
- Code outputs may be syntactically correct but functionally incorrect
π§ Limitations
- Lacks fine-grained safety alignment beyond DPO
- Performance outside of math/code/science domains remains unverified
π Citation
@misc{cai2025infialignscalablesampleefficientframework,
title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities},
author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang},
year={2025},
eprint={2508.05496},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.05496},
}
π News
- β
We released model checkpoint for
InfiAlign-Qwen-7B-DPO
! - β We released InfiAlign-Qwen-7B-DPO-Eval-Response ! This dataset contains the detailed evaluation responses generated by our DPO model across various benchmarks.
- Downloads last month
- 53