Compressing Chain-of-Thought in LLMs via Step Entropy
Abstract
A novel CoT compression framework using step entropy and a two-stage training strategy enhances LLM inference efficiency without significantly reducing accuracy.
Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80\% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.
Community
Researchers introduce a novel method to compress verbose Chain-of-Thought (CoT) reasoning in Large Language Models by identifying and pruning redundant steps using "step entropy" - achieving 35-57% token reduction while maintaining accuracy.
Key Contributions:
๐ฏ Step Entropy Metric: A principled way to measure the informational contribution of individual reasoning steps by aggregating token-level entropy during generation.
๐ Surprising Finding: Up to 80% of low-entropy reasoning steps can be safely removed without accuracy loss, while high-entropy steps are crucial and cannot be pruned.
โก Practical Impact: Achieves substantial efficiency gains across multiple models-DeepSeek-R1: 29.7-43.5% token reduction, Qwen3-8B: 16.2-44.9% token reduction. Maintains or improves accuracy on mathematical reasoning benchmarks.
๐ง Two-Stage Training: Combines Supervised Fine-Tuning with reinforcement learning (GRPO) to teach models to autonomously generate compressed reasoning during inference using [SKIP] tokens.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal (2025)
- SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought (2025)
- Optimizing Length Compression in Large Reasoning Models (2025)
- Reconsidering Overthinking: Penalizing Internal and External Redundancy in CoT Reasoning (2025)
- Think Clearly: Improving Reasoning via Redundant Token Pruning (2025)
- Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement (2025)
- SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper