LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

GitHub Paper

PIR Framework Overview

Introduction

Large Language Models (LLMs) have demonstrated impressive reasoning abilities through chain-of-thought (CoT) approaches, particularly when fine-tuned on high-quality reasoning data from more powerful Large Reasoning Models (LRMs). However, reasoning chains distilled from LRMs often contain numerous functional elements that, while mimicking human problem-solving processes, result in unnecessarily verbose outputs.

LIMOPro introduces PIR (Perplexity-based Importance Refinement), a novel framework that systematically refines reasoning chains to optimize the balance between efficiency and effectiveness. Our approach:

  1. Classifies functional patterns in reasoning chains into four distinct modes: progressive reasoning and three types of functional steps (verification, multi-method validation, and error correction)
  2. Quantitatively measures each functional step's contribution using the PIR metric, which evaluates answer perplexity changes when specific steps are removed
  3. Selectively removes low-importance functional steps while preserving the essential progressive reasoning chain

Models fine-tuned on PIR-optimized datasets maintain or enhance accuracy while significantly reducing response length compared to models trained on unrefined data, achieving up to 55% efficiency improvement across challenging reasoning benchmarks.

Key Features

  • PIR Framework: A novel perplexity-based approach for quantifying reasoning step importance
  • Reasoning Pattern Analysis: Systematic methodology to classify and understand functional elements in reasoning chains
  • Efficient Fine-tuning: Create optimized training datasets that preserve reasoning quality while reducing verbosity
  • Improved Inference Performance: Balance accuracy and efficiency in reasoning-enhanced LLMs

Installation

# Clone the repository
git clone https://github.com/GAIR-NLP/LIMOPro.git
cd LIMOPro

# Install dependencies
conda create -n beyondlimo python=3.10
conda activate beyondlimo
pip install -r requirements.txt

modify the parameters/config of your machine in util/config.sh and util/config.py file

Data Directory Structure

The data directory is organized into several key subdirectories, each serving a specific purpose in the PIR (Perplexity-based Importance Refinement) framework:

original_data

This directory contains the raw, unmodified datasets that serve as the foundation for our work:

  • LIMO: Original reasoning chains distilled from DeepSeek-R1
  • LIMO-V2: Original reasoning chains distilled from QwQ
  • S1: Original reasoning chains distilled from Gemini Flash Thinking

These datasets represent the verbose reasoning chains produced by Large Reasoning Models (LRMs) before any optimization.

structure

This directory contains the analytical components of our work:

  • Step classification: Categorization of each reasoning step into the four distinct modes (progressive reasoning, verification, multi-method validation, and error correction)
  • Step divisions: The segmentation of complete reasoning chains into discrete steps for analysis
  • PIR scores: The calculated perplexity-based importance values for each functional step, which quantify how critical each step is to the final answer

This represents the core analytical work of identifying which steps are essential versus which can be safely removed.

pruning

This directory contains the optimized datasets after applying the PIR framework:

  • Different versions of the datasets with varying pruning ratios
  • Each pruned dataset represents a different efficiency-effectiveness tradeoff
  • These are the refined datasets used for fine-tuning models with improved efficiency

meta

This contains metadata about the datasets

Training

To ensure a fair comparison between original and PIR-refined models, all training was conducted using LLaMA-Factory, a standardized framework for fine-tuning large language models.

Since LIMO is one of our primary baselines, we used identical training scripts from the LIMO repository to ensure fair comparisons. This consistent methodology guarantees that any performance improvements observed in our experiments can be directly attributed to our PIR-refined datasets. When apply PIR to S1, we follow the same training parameters as reported in the S1 repository.

Training Commands

### model
model_name_or_path: Qwen/Qwen2.5-32B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
flash_attn: fa2

### dataset
dataset: <the pruned dataset>
cutoff_len: 16384
overwrite_cache: true
preprocessing_num_workers: 64
template: qwen

### output
output_dir: <custom your own path>
logging_steps: 1
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 15
lr_scheduler_type: cosine
warmup_ratio: 0.0
bf16: true
ddp_timeout: 180000000

note: the data need to be converted to the format required by the LLaMA-Factory, and modify the dataset_info.json file in the LLaMA-Factory repo.

Inference

For evaluation and inference, we provide easy-to-use scripts that allow you to test models trained on both original and PIR-refined datasets.

Quick Start

# put the to be tested dataset in the location of inference/data/
# Run inference with a single command
bash inference/inference.sh MODEL_NAME NUM_GPUS FILE_ID_MIN FILE_ID_MAX DATA_NAME MODEL_PATH SAMPLING_TIMES

Parameters

  • MODEL_NAME: The name of the model
  • NUM_GPUS: Number of GPUs to use for inference
  • FILE_ID_MIN: Starting file ID for batch processing
  • FILE_ID_MAX: Ending file ID for batch processing
  • DATA_NAME: Name of the dataset to evaluate on (e.g., "gsm8k", "aime", "amcmath", "gpqa")
  • MODEL_PATH: The path to your model to be tested
  • SAMPLING_TIMES: The total test times for a single question, used as the n to calculate pass@1.

Example Usage

bash inference/inference.sh limo 8 0 200 gpqa_diamond /pathy/to/your/model 8

Output

The inference results are saved in JSON format in the inference/data/

Evaluation

We provide comprehensive evaluation scripts to assess model performance across various reasoning benchmarks. The evaluation pipeline measures accuracy, token count.

Quick Start

# Run evaluation with a single command
bash eval.sh FILE_ID_MIN FILE_ID_MAX DATA_NAME MODEL_NAME SAMPLING_TIMES

Parameters

  • FILE_ID_MIN: Starting file ID for batch evaluation
  • FILE_ID_MAX: Ending file ID for batch evaluation
  • DATA_NAME: Name of the benchmark dataset (e.g., "gsm8k", "aime", "amcmath", "gpqa")
  • MODEL_NAME: Path to the model checkpoint or HuggingFace model ID
  • SAMPLING_TIMES: Number of sampling iterations for robust evaluation

Example Usage

# Evaluate the baseline model on AIME benchmark
bash eval.sh 0 30 aime limo 8

Results

Model AIME AMC GPQA Diamond
ACC ↑ TOK ↓ EFF ↑ ACC ↑ TOK ↓ EFF ↑ ACC ↑ TOK ↓ EFF ↑
Qwen2.5-32B-Instruct 15.8 954 1.66E-04 67.2 737 9.11E-04 47.0 517 9.08E-04
R1-Distill-Qwen-32B 69.2 9,311 7.43E-05 94.4 5,561 1.70E-04 64.7 5,634 1.15E-04
QwQ 81.7 12,234 6.68E-05 97.8 7,350 1.33E-04 70.2 7,483 9.38E-05
S1-32B 37.9 6,646 5.71E-05 80.9 4,542 1.78E-04 60.7 4,172 1.46E-04
S1-32B-P 42.1+4.2 4,716-29% 8.92E-05+56% 83.1+2.2 3,809-16% 2.18E-04+22% 61.6+0.9 2,472-41% 2.49E-04+71%
LIMO 56.7 12,497 4.53E-05 91.9 5,516 1.67E-04 67.2 7,173 9.36E-05
LIMO-P 63.3+6.6 10,588-15% 5.98E-05+32% 93.8+1.9 5,235-5% 1.79E-04+7% 71.2+4 6,969-3% 1.02E-04+9%
LIMO-V2 66.3 13,896 4.77E-05 94.4 6,843 1.38E-04 70.2 8,035 8.74E-05
LIMO-V2-P 71.2+4.9 12,163-12% 5.86E-05+23% 96.6+2.2 6,348-7% 1.52E-04+10% 74.2+3 6,968-13% 1.07E-04+22%

link to our model

link to our dataset

License

This project is licensed under the MIT License - see the LICENSE file for details.

Downloads last month
17
Safetensors
Model size
32.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YangXiao-nlp/LIMOPro-LIMO-P

Base model

Qwen/Qwen2.5-32B
Finetuned
(1029)
this model
Quantizations
2 models