Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning
Abstract
Reasoning Path Compression improves inference throughput of reasoning LLMs by exploiting semantic sparsity in reasoning paths without significantly reducing accuracy.
Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60times compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.
Community
Reasoning Path Compression (RPC) is a training-free method for accelerating inference of reasoning language models by leveraging the semantic sparsity of generated reasoning paths. It improves throughput and reduces memory usage with minimal accuracy drop.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning (2025)
- Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning (2025)
- Fractured Chain-of-Thought Reasoning (2025)
- M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models (2025)
- When Reasoning Meets Compression: Benchmarking Compressed Large Reasoning Models on Complex Reasoning Tasks (2025)
- Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models (2025)
- Efficient Reasoning Models: A Survey (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper