Welcome to FairyR1-14B-Preview created by PKU-DS-LAB!

Benchmark DeepSeek-R1-Distill-Qwen-32B DeepSeek-R1-Distill-Qwen-14B FairyR1-14B-Preview (PKU) FairyR1-32B (PKU)
AIME 2024 (Math) 72.6 69.7 73.7 80.4
AIME 2025 (Math) 52.9 50.0 64.9 75.6
LiveCodeBench (Code) 57.1 53.1 58.8 67.7
GPQA-Diamond (Sci-QA) 62.1 59.1 53.2 60.0

Introduction

FairyR1-14B-Preview, a highly efficient large-language-model (LLM) that matches or exceeds larger models on select tasks. Built atop the DeepSeek-R1-Distill-Qwen-14B base, this model continues to utilize the 'distill-and-merge' pipeline from TinyR1-32B-Preview and Fairy-32B, combining task-focused fine-tuning with model-merging techniques—to deliver competitive performance with drastically reduced size and inference cost. This project was funded by NSFC, Grant 624B2005.

As a member of the FairyR1 series, FairyR1-14B-Preview shares the same training data and process as FairyR1-32B. We strongly recommend using the FairyR1-32B, which achieves comparable performance in math and coding to deepseek-R1-671B with only 5% of the parameters. For more details, please view the page of FairyR1-32B.

Model Details

The FairyR1 model represents a further exploration of our earlier work TinyR1, retaining the core “Branch-Merge Distillation” approach while introducing refinements in data processing and model architecture.

In this effort, we overhauled the distillation data pipeline: raw examples from datasets such as AIMO/NuminaMath-1.5 for mathematics and OpenThoughts-114k for code were first passed through multiple 'teacher' models to generate candidate answers. These candidates were then carefully selected, restructured, and refined, especially for the chain-of-thought(CoT). Subsequently, we applied multi-stage filtering—including automated correctness checks for math problems and length-based selection (2K–8K tokens for math samples, 4K–8K tokens for code samples). This yielded two focused training sets of roughly 6.6K math examples and 3.8K code examples.

On the modeling side, rather than training three separate specialists as before, we limited our scope to just two domain experts (math and code), each trained independently under identical hyperparameters (e.g., learning rate and batch size) for about five epochs. We then fused these experts into a single 14B-parameter model using the AcreeFusion tool. By streamlining both the data distillation workflow and the specialist-model merging process, FairyR1 achieves task-competitive results with only a fraction of the parameters and computational cost of much larger models.

Result Analysis and Key Contributions:

From the test results, FairyR1-14B-Preview scored slightly higher than DeepSeek-R1-Distill-Qwen-32B on the AIME 2024 (Math) (73.7 vs 72.6) and LiveCodeBench (Code) (58.8 vs 57.1) benchmarks, and demonstrated notably stronger performance on AIME 2025 (Math) (64.9 vs 52.9).These results indicate that FairyR1-14B-Preview achieves superior performance in mathematical and programming domains compared to the larger model DeepSeek-R1-Distill-Qwen-32B.

This work demonstrates the feasibility of significantly reducing model size and potential inference cost through optimized data processing and model fusion techniques while maintaining strong task-specific performance.

Model Description

  • Developed by: PKU-DS-LAB
  • Model type: Reasoning Model
  • Language(s) (NLP): English, Chinese
  • License: apache-2.0
  • Finetuned from model: DeepSeek-R1-Distill-Qwen-14B

Training Data

Hardware Utilization

  • Hardware Type: 16 × NVIDIA-H100
  • Hours used(Math): 2.5h
  • Hours used(Coding): 1.5h
  • Model Merging: about 20 min on CPU, no GPU needed.

Evaluation Set

  • AIME 2024/2025 (math): We evaluate 32 times and report the average accuracy. AIME 2024 contains 30 problems. AIME 2025 consists of Part I and Part II, with a total of 30 questions.
  • LiveCodeBench (code): We evaluate 8 times and report the average accuracy. The dataset version is "release_v5" (date range: 2024-08-01 to 2025-02-01), consisting of 279 problems.
  • GPQA-Diamond (Sci-QA): We evaluate 8 times and report the average accuracy. The dataset consists of 198 problems.

FairyR1 series Team Members:

Leading By:

Tong Yang

Core Contributors:

Wang Li; Junting Zhou; Wenrui Liu; Yilun Yao; Rongle Wang

Model Card Contact

For more details, please contact: [email protected]

Downloads last month
397
Safetensors
Model size
14.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PKU-DS-LAB/FairyR1-14B-Preview

Finetuned
(54)
this model
Quantizations
3 models

Collection including PKU-DS-LAB/FairyR1-14B-Preview