LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Introduction

Large Language Models (LLMs) have demonstrated impressive reasoning abilities through chain-of-thought (CoT) approaches, particularly when fine-tuned on high-quality reasoning data from more powerful Large Reasoning Models (LRMs). However, reasoning chains distilled from LRMs often contain numerous functional elements that, while mimicking human problem-solving processes, result in unnecessarily verbose outputs.

LIMOPro introduces PIR (Perplexity-based Importance Refinement), a novel framework that systematically refines reasoning chains to optimize the balance between efficiency and effectiveness. Our approach:

Classifies functional patterns in reasoning chains into four distinct modes: progressive reasoning and three types of functional steps (verification, multi-method validation, and error correction)
Quantitatively measures each functional step's contribution using the PIR metric, which evaluates answer perplexity changes when specific steps are removed
Selectively removes low-importance functional steps while preserving the essential progressive reasoning chain

Models fine-tuned on PIR-optimized datasets maintain or enhance accuracy while significantly reducing response length compared to models trained on unrefined data, achieving up to 55% efficiency improvement across challenging reasoning benchmarks.

Key Features

PIR Framework: A novel perplexity-based approach for quantifying reasoning step importance
Reasoning Pattern Analysis: Systematic methodology to classify and understand functional elements in reasoning chains
Efficient Fine-tuning: Create optimized training datasets that preserve reasoning quality while reducing verbosity
Improved Inference Performance: Balance accuracy and efficiency in reasoning-enhanced LLMs

Installation

# Clone the repository
git clone https://github.com/GAIR-NLP/LIMOPro.git
cd LIMOPro

# Install dependencies
conda create -n beyondlimo python=3.10
conda activate beyondlimo
pip install -r requirements.txt

modify the parameters/config of your machine in util/config.sh and util/config.py file

Data Directory Structure

The data directory is organized into several key subdirectories, each serving a specific purpose in the PIR (Perplexity-based Importance Refinement) framework:

original_data

This directory contains the raw, unmodified datasets that serve as the foundation for our work:

LIMO: Original reasoning chains distilled from DeepSeek-R1
LIMO-V2: Original reasoning chains distilled from QwQ
S1: Original reasoning chains distilled from Gemini Flash Thinking

These datasets represent the verbose reasoning chains produced by Large Reasoning Models (LRMs) before any optimization.

structure

This directory contains the analytical components of our work:

Step classification: Categorization of each reasoning step into the four distinct modes (progressive reasoning, verification, multi-method validation, and error correction)
Step divisions: The segmentation of complete reasoning chains into discrete steps for analysis
PIR scores: The calculated perplexity-based importance values for each functional step, which quantify how critical each step is to the final answer

This represents the core analytical work of identifying which steps are essential versus which can be safely removed.

pruning

This directory contains the optimized datasets after applying the PIR framework:

Different versions of the datasets with varying pruning ratios
Each pruned dataset represents a different efficiency-effectiveness tradeoff
These are the refined datasets used for fine-tuning models with improved efficiency

Training

To ensure a fair comparison between original and PIR-refined models, all training was conducted using LLaMA-Factory, a standardized framework for fine-tuning large language models.

Since LIMO is one of our primary baselines, we used identical training scripts from the LIMO repository to ensure fair comparisons. This consistent methodology guarantees that any performance improvements observed in our experiments can be directly attributed to our PIR-refined datasets. When apply PIR to S1, we follow the same training parameters as reported in the S1 repository.

Training Commands

### model
model_name_or_path: Qwen/Qwen2.5-32B-Instruct

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
flash_attn: fa2

### dataset
dataset: <the pruned dataset>
cutoff_len: 16384
overwrite_cache: true
preprocessing_num_workers: 64
template: qwen

### output
output_dir: <custom your own path>
logging_steps: 1
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 15
lr_scheduler_type: cosine
warmup_ratio: 0.0
bf16: true
ddp_timeout: 180000000

note: the data need to be converted to the format required by the LLaMA-Factory, and modify the dataset_info.json file in the LLaMA-Factory repo.

Inference

For evaluation and inference, we provide easy-to-use scripts that allow you to test models trained on both original and PIR-refined datasets.

Quick Start

# put the to be tested dataset in the location of inference/data/
# Run inference with a single command
bash inference/inference.sh MODEL_NAME NUM_GPUS FILE_ID_MIN FILE_ID_MAX DATA_NAME MODEL_PATH SAMPLING_TIMES

Parameters

MODEL_NAME: The name of the model
NUM_GPUS: Number of GPUs to use for inference
FILE_ID_MIN: Starting file ID for batch processing
FILE_ID_MAX: Ending file ID for batch processing
DATA_NAME: Name of the dataset to evaluate on (e.g., "gsm8k", "aime", "amcmath", "gpqa")
MODEL_PATH: The path to your model to be tested
SAMPLING_TIMES: The total test times for a single question, used as the n to calculate pass@1.

Example Usage

bash inference/inference.sh limo 8 0 200 gpqa_diamond /pathy/to/your/model 8

Output

The inference results are saved in JSON format in the inference/data/

Evaluation

We provide comprehensive evaluation scripts to assess model performance across various reasoning benchmarks. The evaluation pipeline measures accuracy, token count.

Quick Start

# Run evaluation with a single command
bash eval.sh FILE_ID_MIN FILE_ID_MAX DATA_NAME MODEL_NAME SAMPLING_TIMES

Parameters

FILE_ID_MIN: Starting file ID for batch evaluation
FILE_ID_MAX: Ending file ID for batch evaluation
DATA_NAME: Name of the benchmark dataset (e.g., "gsm8k", "aime", "amcmath", "gpqa")
MODEL_NAME: Path to the model checkpoint or HuggingFace model ID
SAMPLING_TIMES: Number of sampling iterations for robust evaluation

Example Usage

# Evaluate the baseline model on AIME benchmark
bash eval.sh 0 30 aime limo 8

Results

Model	AIME			AMC			GPQA Diamond
	ACC ↑	TOK ↓	EFF ↑	ACC ↑	TOK ↓	EFF ↑	ACC ↑	TOK ↓	EFF ↑
Qwen2.5-32B-Instruct	15.8	954	1.66E-04	67.2	737	9.11E-04	47.0	517	9.08E-04
R1-Distill-Qwen-32B	69.2	9,311	7.43E-05	94.4	5,561	1.70E-04	64.7	5,634	1.15E-04
QwQ	81.7	12,234	6.68E-05	97.8	7,350	1.33E-04	70.2	7,483	9.38E-05
S1-32B	37.9	6,646	5.71E-05	80.9	4,542	1.78E-04	60.7	4,172	1.46E-04
S1-32B-P	42.1_+4.2	4,716_-29%	8.92E-05_+56%	83.1_+2.2	3,809_-16%	2.18E-04_+22%	61.6_+0.9	2,472_-41%	2.49E-04_+71%
LIMO	56.7	12,497	4.53E-05	91.9	5,516	1.67E-04	67.2	7,173	9.36E-05
LIMO-P	63.3_+6.6	10,588_-15%	5.98E-05_+32%	93.8_+1.9	5,235_-5%	1.79E-04_+7%	71.2₊₄	6,969_-3%	1.02E-04_+9%
LIMO-V2	66.3	13,896	4.77E-05	94.4	6,843	1.38E-04	70.2	8,035	8.74E-05
LIMO-V2-P	71.2_+4.9	12,163_-12%	5.86E-05_+23%	96.6_+2.2	6,348_-7%	1.52E-04_+10%	74.2₊₃	6,968_-13%	1.07E-04_+22%

link to our model

LIMO-P: 🤗 HuggingFace
LIMO-V2-P: 🤗 HuggingFace
S1-32B-P: 🤗 HuggingFace

link to our dataset

LIMO-P: 🤗 HuggingFace
LIMO-V2-P: 🤗 HuggingFace
S1-32B-P: 🤗 HuggingFace

License

This project is licensed under the MIT License - see the LICENSE file for details.

YangXiao-nlp
/

LIMOPro-LIMO-P

LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Introduction

Key Features

Installation

Data Directory Structure

original_data

structure

pruning

meta

Training

Training Commands

Inference

Quick Start

Parameters

Example Usage

Output

Evaluation

Quick Start

Parameters

Example Usage

Results

link to our model

link to our dataset

License

Model tree for YangXiao-nlp/LIMOPro-LIMO-P