🤖 Model Card: InfiX-ai/InfiAlign-Qwen-7B-DPO

InfiAlign is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models.

At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources.

When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks.

Additional improvements are obtained through the application of Direct Preference Optimization (DPO), with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks.

🚀 InfiAlign Model Series

The InfiAlign framework offers multiple variants tailored for different alignment strategies:

InfiAlign-Qwen-7B-SFT: Fine-tuned using curriculum-style instruction data.
InfiAlign-Qwen-7B-DPO: Trained with Direct Preference Optimization (DPO) to improve reasoning alignment. [You are here!]
InfiAlign-Qwen-7B-R1: Reinforcement learning variant (GRPO) for further refinement.

📋 Model Description

Model Name: InfiAlign-Qwen-7B-DPO
Developed by: InfiX-ai
Fine-tuned from: InfiAlign-Qwen-7B-SFT
Model Type: 7B-parameter decoder-only Transformer
Context Length: 32K tokens
License: Apache 2.0
Status: Static checkpoint (offline training)

🏋️ Training Details

📊 Dataset Overview

Total of 10K curated samples across three core reasoning domains:

Domain	Curated Samples
Mathematics	3.5K
Code	3.5K
Science	3K

Each sample includes preference-ranked completions distilled from stronger teacher models, selected for difficulty and diversity.

Data Source: OpenMathReasoning, Mixture-of-Thoughts, OpenScience

📊 Data Pipeline

Data Decontamination and Deduplication: Decontaminate data against evaluation benchmarks and deduplicate samples from the SFT training dataset.
Data Selection: We first utilize Qwen2.5-32B-Instruct to annotate each sample with domain-specific labels. For each category, we select the problems with the longest solution, representing the most challenging problems. Our SFT model then generates responses for these selected problems, which are used in the subsequent rejection sampling step.
Reject Sampling: We employ Qwen2.5-32B-Instruct to evaluate the SFT model's responses to math and science questions, and utilize an internal sandbox service to verify the correctness of code-related answers. For each domain, we select false samples with the longest solution lengths from each category, ensuring a balanced number of samples across categories. We directly use the solutions generated by DeepSeek-R1 as the positive samples, and pair them with the selected false samples to construct training pairs.

🏗️ Training Procedure

🧠 Alignment Algorithm: Direct Preference Optimization (DPO)

⚙️ Training Hyperparameters:

Hyperparameter	Value
Batch Size	16
Learning Rate	5e-7
LR Scheduler	cosine
Warmup Ratio	0.1
Epoch	3
Sequence Parallelism	4
Loss	sigmoid preference loss
Preference Beta	0.1

📊 Evaluation

We evaluate InfiAlign-Qwen-7B-DPO on a range of benchmarks to assess its reasoning, problem-solving, and code generation capabilities. All metrics are reported as Pass@1 under a consistent regex-based answer extraction pipeline, adapted from LIMO.

🧪 Benchmark Overview

AIME24 / AIME25: American Invitational Mathematics Examination problems (Olympiad-level high school math).
MATH500: Subset of the MATH dataset focused on complex mathematical reasoning.
GPQA (Graduate Physics QA): Advanced physics multiple-choice questions.
MMLU-Pro: Professional-level subset of the Massive Multitask Language Understanding benchmark.
LiveCodeBench: Code reasoning benchmark using real-world coding problems.

🏆 Performance Comparison (Pass@1)

Model	Initial CKPT	Data Size	AIME 2025 (avg@64)	AIME 2024 (avg@64)	MATH500 (avg@4)	GPQA Diamond (avg@8)	MMLU-Pro (pass@1)	LiveCodeBench-v5 (avg@8)	Avg.
Qwen2.5-7B-Instruct	Qwen2.5-7B-Base	1M	8.80	11.93	76.15	38.70	57.49	15.77	34.80
Qwen2.5-Math-7B-Instruct	Qwen2.5-7B-Math-Base	2.5M	6.72	6.67	82.40	31.12	43.06	2.68	28.78
DeepSeek-Distill-Qwen-7B	Qwen2.5-7B-Math-Base	800K	37.97	55.50*	92.80*	49.10*	54.16	37.60*	54.43
OpenThinker2-7B	Qwen2.5-7B-Instruct	1M	38.70*	60.70*	87.60*	47.00*	40.60*	37.50	52.01
Light-R1-7B-DS	DeepSeek-Distill-Qwen-7B	3K	44.30*	59.10*	91.35	49.40*	54.95	38.40	56.25

InfiAlign-Qwen-7B-SFT-92K (ours)	Qwen2.5-7B-Math-Base	92K	43.39	56.46	92.35	48.48	53.51	34.05	54.70
InfiAlign-Qwen-7B-DPO-9K (ours)	InfiAlign-Qwen-7B-SFT-92K	9K	44.06	61.04	91.95	48.17	49.90	34.54	54.94

InfiAlign-Qwen-7B-SFT-165K (ours) 🤗	Qwen2.5-7B-Math-Base	165K	42.19	63.75	92.70	53.60	56.68	36.20	57.52
InfiAlign-Qwen-7B-DPO-10K (ours) 🤗	InfiAlign-Qwen-7B-SFT-165K	10K	47.45	61.25	93.45	51.77	53.95	35.30	57.20

🧪 Usage

Here is a code snippet with apply_chat_template showing you how to load the tokenizer and model and how to generate content.

PS: Make sure the model starts with "<think>\n" to avoid generating empty thoughts, which will reduce the output quality. If you use "apply_chat_template" and set "add_generation_prompt=True", this will be automatically implemented, but this may result in a missing "<think>" label at the beginning of the response.


from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "InfiX-ai/InfiAlign-Qwen-7B-SFT"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

🎯 Intended Uses

✅ Appropriate Use Cases

Reasoning tasks in math, science, and code
Chat-based AI assistants requiring structured problem-solving
Educational and research tools focused on logic-based domains

❌ Out-of-Scope Uses

High-stakes applications (e.g., legal, medical)
Non-English or multilingual scenarios (model is primarily trained on English)
Tasks not related to reasoning or logic-intensive domains

⚖️ Bias, Risks, and Limitations

🎭 Bias

English-centric training may result in underperformance on non-English tasks
Potential propagation of stereotypes or social biases from source data

⚠️ Risks

May produce hallucinated or incorrect outputs
Risk of unsafe or offensive completions in adversarial contexts
Code outputs may be syntactically correct but functionally incorrect

🚧 Limitations

Lacks fine-grained safety alignment beyond DPO
Performance outside of math/code/science domains remains unverified

📚 Citation

@misc{cai2025infialignscalablesampleefficientframework,
      title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities}, 
      author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang},
      year={2025},
      eprint={2508.05496},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.05496}, 
}

📌 News

✅ We released model checkpoint for InfiAlign-Qwen-7B-DPO !
✅ We released InfiAlign-Qwen-7B-DPO-Eval-Response ! This dataset contains the detailed evaluation responses generated by our DPO model across various benchmarks.

InfiX-ai
/

InfiAlign-Qwen-7B-DPO