LeanK: Learnable K Cache Channel Pruning for Efficient Decoding
Abstract
LeanK, a learning-based method, prunes unimportant key cache channels in large language models to reduce memory usage and accelerate decoding without sacrificing accuracy.
Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.
Community
Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. LeanK proposes a novel KV cache pruning method that leverages the static sparsity of QK vectors along the channel (head_dim) dimension. A static, channel-wise pruning mask is derived through a novel two-stage training process. Experimental results demonstrate up to 70% reduction in K cache and 16–18% reduction in V cache, with minimal impact on end-to-end model performance.
This method introduces a new direction for KV cache compression that is orthogonal to existing approaches such as token-level compression and quantization. In addition, LeanK observes that the channel-wise norm distribution of QK vectors in each attention head correlates with the head’s retrieval capacity. This insight opens up promising avenues for enhancing the long-context understanding abilities of LLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TriangleMix: A Lossless and Efficient Attention Pattern for Long Context Prefilling (2025)
- HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs (2025)
- CaliDrop: KV Cache Compression with Calibration (2025)
- Efficient Long-Context LLM Inference via KV Cache Clustering (2025)
- Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction (2025)
- SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity (2025)
- CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper