LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling
Introduction
Large Language Models (LLMs) have demonstrated impressive reasoning abilities through chain-of-thought (CoT) approaches, particularly when fine-tuned on high-quality reasoning data from more powerful Large Reasoning Models (LRMs). However, reasoning chains distilled from LRMs often contain numerous functional elements that, while mimicking human problem-solving processes, result in unnecessarily verbose outputs.
LIMOPro introduces PIR (Perplexity-based Importance Refinement), a novel framework that systematically refines reasoning chains to optimize the balance between efficiency and effectiveness. Our approach:
- Classifies functional patterns in reasoning chains into four distinct modes: progressive reasoning and three types of functional steps (verification, multi-method validation, and error correction)
- Quantitatively measures each functional step's contribution using the PIR metric, which evaluates answer perplexity changes when specific steps are removed
- Selectively removes low-importance functional steps while preserving the essential progressive reasoning chain
Models fine-tuned on PIR-optimized datasets maintain or enhance accuracy while significantly reducing response length compared to models trained on unrefined data, achieving up to 55% efficiency improvement across challenging reasoning benchmarks.
Key Features
- PIR Framework: A novel perplexity-based approach for quantifying reasoning step importance
- Reasoning Pattern Analysis: Systematic methodology to classify and understand functional elements in reasoning chains
- Efficient Fine-tuning: Create optimized training datasets that preserve reasoning quality while reducing verbosity
- Improved Inference Performance: Balance accuracy and efficiency in reasoning-enhanced LLMs
Installation
# Clone the repository
git clone https://github.com/GAIR-NLP/LIMOPro.git
cd LIMOPro
# Install dependencies
conda create -n beyondlimo python=3.10
conda activate beyondlimo
pip install -r requirements.txt
modify the parameters/config of your machine in util/config.sh
and util/config.py
file
Data Directory Structure
The data
directory is organized into several key subdirectories, each serving a specific purpose in the PIR (Perplexity-based Importance Refinement) framework:
original_data
This directory contains the raw, unmodified datasets that serve as the foundation for our work:
- LIMO: Original reasoning chains distilled from DeepSeek-R1
- LIMO-V2: Original reasoning chains distilled from QwQ
- S1: Original reasoning chains distilled from Gemini Flash Thinking
These datasets represent the verbose reasoning chains produced by Large Reasoning Models (LRMs) before any optimization.
structure
This directory contains the analytical components of our work:
- Step classification: Categorization of each reasoning step into the four distinct modes (progressive reasoning, verification, multi-method validation, and error correction)
- Step divisions: The segmentation of complete reasoning chains into discrete steps for analysis
- PIR scores: The calculated perplexity-based importance values for each functional step, which quantify how critical each step is to the final answer
This represents the core analytical work of identifying which steps are essential versus which can be safely removed.
pruning
This directory contains the optimized datasets after applying the PIR framework:
- Different versions of the datasets with varying pruning ratios
- Each pruned dataset represents a different efficiency-effectiveness tradeoff
- These are the refined datasets used for fine-tuning models with improved efficiency
meta
This contains metadata about the datasets
Training
To ensure a fair comparison between original and PIR-refined models, all training was conducted using LLaMA-Factory, a standardized framework for fine-tuning large language models.
Since LIMO is one of our primary baselines, we used identical training scripts from the LIMO repository to ensure fair comparisons. This consistent methodology guarantees that any performance improvements observed in our experiments can be directly attributed to our PIR-refined datasets. When apply PIR to S1, we follow the same training parameters as reported in the S1 repository.
Training Commands
### model
model_name_or_path: Qwen/Qwen2.5-32B-Instruct
### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json
flash_attn: fa2
### dataset
dataset: <the pruned dataset>
cutoff_len: 16384
overwrite_cache: true
preprocessing_num_workers: 64
template: qwen
### output
output_dir: <custom your own path>
logging_steps: 1
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
learning_rate: 5.0e-6
num_train_epochs: 15
lr_scheduler_type: cosine
warmup_ratio: 0.0
bf16: true
ddp_timeout: 180000000
note: the data need to be converted to the format required by the LLaMA-Factory, and modify the dataset_info.json file in the LLaMA-Factory repo.
Inference
For evaluation and inference, we provide easy-to-use scripts that allow you to test models trained on both original and PIR-refined datasets.
Quick Start
# put the to be tested dataset in the location of inference/data/
# Run inference with a single command
bash inference/inference.sh MODEL_NAME NUM_GPUS FILE_ID_MIN FILE_ID_MAX DATA_NAME MODEL_PATH SAMPLING_TIMES
Parameters
MODEL_NAME
: The name of the modelNUM_GPUS
: Number of GPUs to use for inferenceFILE_ID_MIN
: Starting file ID for batch processingFILE_ID_MAX
: Ending file ID for batch processingDATA_NAME
: Name of the dataset to evaluate on (e.g., "gsm8k", "aime", "amcmath", "gpqa")MODEL_PATH
: The path to your model to be testedSAMPLING_TIMES
: The total test times for a single question, used as the n to calculate pass@1.
Example Usage
bash inference/inference.sh limo 8 0 200 gpqa_diamond /pathy/to/your/model 8
Output
The inference results are saved in JSON format in the inference/data/
Evaluation
We provide comprehensive evaluation scripts to assess model performance across various reasoning benchmarks. The evaluation pipeline measures accuracy, token count.
Quick Start
# Run evaluation with a single command
bash eval.sh FILE_ID_MIN FILE_ID_MAX DATA_NAME MODEL_NAME SAMPLING_TIMES
Parameters
FILE_ID_MIN
: Starting file ID for batch evaluationFILE_ID_MAX
: Ending file ID for batch evaluationDATA_NAME
: Name of the benchmark dataset (e.g., "gsm8k", "aime", "amcmath", "gpqa")MODEL_NAME
: Path to the model checkpoint or HuggingFace model IDSAMPLING_TIMES
: Number of sampling iterations for robust evaluation
Example Usage
# Evaluate the baseline model on AIME benchmark
bash eval.sh 0 30 aime limo 8
Results
Model | AIME | AMC | GPQA Diamond | ||||||
---|---|---|---|---|---|---|---|---|---|
ACC ↑ | TOK ↓ | EFF ↑ | ACC ↑ | TOK ↓ | EFF ↑ | ACC ↑ | TOK ↓ | EFF ↑ | |
Qwen2.5-32B-Instruct | 15.8 | 954 | 1.66E-04 | 67.2 | 737 | 9.11E-04 | 47.0 | 517 | 9.08E-04 |
R1-Distill-Qwen-32B | 69.2 | 9,311 | 7.43E-05 | 94.4 | 5,561 | 1.70E-04 | 64.7 | 5,634 | 1.15E-04 |
QwQ | 81.7 | 12,234 | 6.68E-05 | 97.8 | 7,350 | 1.33E-04 | 70.2 | 7,483 | 9.38E-05 |
S1-32B | 37.9 | 6,646 | 5.71E-05 | 80.9 | 4,542 | 1.78E-04 | 60.7 | 4,172 | 1.46E-04 |
S1-32B-P | 42.1+4.2 | 4,716-29% | 8.92E-05+56% | 83.1+2.2 | 3,809-16% | 2.18E-04+22% | 61.6+0.9 | 2,472-41% | 2.49E-04+71% |
LIMO | 56.7 | 12,497 | 4.53E-05 | 91.9 | 5,516 | 1.67E-04 | 67.2 | 7,173 | 9.36E-05 |
LIMO-P | 63.3+6.6 | 10,588-15% | 5.98E-05+32% | 93.8+1.9 | 5,235-5% | 1.79E-04+7% | 71.2+4 | 6,969-3% | 1.02E-04+9% |
LIMO-V2 | 66.3 | 13,896 | 4.77E-05 | 94.4 | 6,843 | 1.38E-04 | 70.2 | 8,035 | 8.74E-05 |
LIMO-V2-P | 71.2+4.9 | 12,163-12% | 5.86E-05+23% | 96.6+2.2 | 6,348-7% | 1.52E-04+10% | 74.2+3 | 6,968-13% | 1.07E-04+22% |
link to our model
- LIMO-P: 🤗 HuggingFace
- LIMO-V2-P: 🤗 HuggingFace
- S1-32B-P: 🤗 HuggingFace
link to our dataset
- LIMO-P: 🤗 HuggingFace
- LIMO-V2-P: 🤗 HuggingFace
- S1-32B-P: 🤗 HuggingFace
License
This project is licensed under the MIT License - see the LICENSE file for details.
- Downloads last month
- 17