πŸ€– Model Card: InfiX-ai/InfiAlign-Qwen-7B-DPO

arXiv Paper Hugging Face Paper Hugging Face SFT Model Hugging Face DPO Model GitHub Repository

InfiAlign is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models.

At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources.

When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks.

Additional improvements are obtained through the application of Direct Preference Optimization (DPO), with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks.

πŸš€ InfiAlign Model Series

The InfiAlign framework offers multiple variants tailored for different alignment strategies:

πŸ“‹ Model Description

  • Model Name: InfiAlign-Qwen-7B-DPO
  • Developed by: InfiX-ai
  • Fine-tuned from: InfiAlign-Qwen-7B-SFT
  • Model Type: 7B-parameter decoder-only Transformer
  • Context Length: 32K tokens
  • License: Apache 2.0
  • Status: Static checkpoint (offline training)

πŸ‹οΈ Training Details

πŸ“Š Dataset Overview

Total of 10K curated samples across three core reasoning domains:

Domain Curated Samples
Mathematics 3.5K
Code 3.5K
Science 3K

Each sample includes preference-ranked completions distilled from stronger teacher models, selected for difficulty and diversity.

Data Source: OpenMathReasoning, Mixture-of-Thoughts, OpenScience

πŸ“Š Data Pipeline

  • Data Decontamination and Deduplication: Decontaminate data against evaluation benchmarks and deduplicate samples from the SFT training dataset.
  • Data Selection: We first utilize Qwen2.5-32B-Instruct to annotate each sample with domain-specific labels. For each category, we select the problems with the longest solution, representing the most challenging problems. Our SFT model then generates responses for these selected problems, which are used in the subsequent rejection sampling step.
  • Reject Sampling: We employ Qwen2.5-32B-Instruct to evaluate the SFT model's responses to math and science questions, and utilize an internal sandbox service to verify the correctness of code-related answers. For each domain, we select false samples with the longest solution lengths from each category, ensuring a balanced number of samples across categories. We directly use the solutions generated by DeepSeek-R1 as the positive samples, and pair them with the selected false samples to construct training pairs.

πŸ—οΈ Training Procedure

🧠 Alignment Algorithm: Direct Preference Optimization (DPO)

βš™οΈ Training Hyperparameters:

Hyperparameter Value
Batch Size 16
Learning Rate 5e-7
LR Scheduler cosine
Warmup Ratio 0.1
Epoch 3
Sequence Parallelism 4
Loss sigmoid preference loss
Preference Beta 0.1

πŸ“Š Evaluation

We evaluate InfiAlign-Qwen-7B-DPO on a range of benchmarks to assess its reasoning, problem-solving, and code generation capabilities. All metrics are reported as Pass@1 under a consistent regex-based answer extraction pipeline, adapted from LIMO.

πŸ§ͺ Benchmark Overview

  • AIME24 / AIME25: American Invitational Mathematics Examination problems (Olympiad-level high school math).
  • MATH500: Subset of the MATH dataset focused on complex mathematical reasoning.
  • GPQA (Graduate Physics QA): Advanced physics multiple-choice questions.
  • MMLU-Pro: Professional-level subset of the Massive Multitask Language Understanding benchmark.
  • LiveCodeBench: Code reasoning benchmark using real-world coding problems.

πŸ† Performance Comparison (Pass@1)

Model Initial CKPT Data Size AIME 2025
(avg@64)
AIME 2024
(avg@64)
MATH500
(avg@4)
GPQA Diamond
(avg@8)
MMLU-Pro
(pass@1)
LiveCodeBench-v5
(avg@8)
Avg.
Qwen2.5-7B-Instruct Qwen2.5-7B-Base 1M 8.80 11.93 76.15 38.70 57.49 15.77 34.80
Qwen2.5-Math-7B-Instruct Qwen2.5-7B-Math-Base 2.5M 6.72 6.67 82.40 31.12 43.06 2.68 28.78
DeepSeek-Distill-Qwen-7B Qwen2.5-7B-Math-Base 800K 37.97 55.50* 92.80* 49.10* 54.16 37.60* 54.43
OpenThinker2-7B Qwen2.5-7B-Instruct 1M 38.70* 60.70* 87.60* 47.00* 40.60* 37.50 52.01
Light-R1-7B-DS DeepSeek-Distill-Qwen-7B 3K 44.30* 59.10* 91.35 49.40* 54.95 38.40 56.25
InfiAlign-Qwen-7B-SFT-92K (ours) Qwen2.5-7B-Math-Base 92K 43.39 56.46 92.35 48.48 53.51 34.05 54.70
InfiAlign-Qwen-7B-DPO-9K (ours) InfiAlign-Qwen-7B-SFT-92K 9K 44.06 61.04 91.95 48.17 49.90 34.54 54.94
InfiAlign-Qwen-7B-SFT-165K (ours) πŸ€— Qwen2.5-7B-Math-Base 165K 42.19 63.75 92.70 53.60 56.68 36.20 57.52
InfiAlign-Qwen-7B-DPO-10K (ours) πŸ€— InfiAlign-Qwen-7B-SFT-165K 10K 47.45 61.25 93.45 51.77 53.95 35.30 57.20

πŸ§ͺ Usage

Here is a code snippet with apply_chat_template showing you how to load the tokenizer and model and how to generate content.

  • PS: Make sure the model starts with "<think>\n" to avoid generating empty thoughts, which will reduce the output quality. If you use "apply_chat_template" and set "add_generation_prompt=True", this will be automatically implemented, but this may result in a missing "<think>" label at the beginning of the response.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "InfiX-ai/InfiAlign-Qwen-7B-SFT"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

🎯 Intended Uses

βœ… Appropriate Use Cases

  • Reasoning tasks in math, science, and code
  • Chat-based AI assistants requiring structured problem-solving
  • Educational and research tools focused on logic-based domains

❌ Out-of-Scope Uses

  • High-stakes applications (e.g., legal, medical)
  • Non-English or multilingual scenarios (model is primarily trained on English)
  • Tasks not related to reasoning or logic-intensive domains

βš–οΈ Bias, Risks, and Limitations

🎭 Bias

  • English-centric training may result in underperformance on non-English tasks
  • Potential propagation of stereotypes or social biases from source data

⚠️ Risks

  • May produce hallucinated or incorrect outputs
  • Risk of unsafe or offensive completions in adversarial contexts
  • Code outputs may be syntactically correct but functionally incorrect

🚧 Limitations

  • Lacks fine-grained safety alignment beyond DPO
  • Performance outside of math/code/science domains remains unverified

πŸ“š Citation

@misc{cai2025infialignscalablesampleefficientframework,
      title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities}, 
      author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang},
      year={2025},
      eprint={2508.05496},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.05496}, 
}

πŸ“Œ News

  • βœ… We released model checkpoint for InfiAlign-Qwen-7B-DPO !
  • βœ… We released InfiAlign-Qwen-7B-DPO-Eval-Response ! This dataset contains the detailed evaluation responses generated by our DPO model across various benchmarks.
Downloads last month
53
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for InfiX-ai/InfiAlign-Qwen-7B-DPO

Base model

Qwen/Qwen2.5-7B
Finetuned
(410)
this model
Quantizations
2 models

Collection including InfiX-ai/InfiAlign-Qwen-7B-DPO