arxiv:2508.02215

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

Published on Aug 4

· Submitted by

zhangyik21 on Aug 7

Upvote

Authors:

Yike Zhang ,

Zhiyuan He ,

Abstract

LeanK, a learning-based method, prunes unimportant key cache channels in large language models to reduce memory usage and accelerate decoding without sacrificing accuracy.

AI-generated summary

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

View arXiv page View PDF Project page Add to collection

Community

zhangyik21

Paper author Paper submitter 2 days ago

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. LeanK proposes a novel KV cache pruning method that leverages the static sparsity of QK vectors along the channel (head_dim) dimension. A static, channel-wise pruning mask is derived through a novel two-stage training process. Experimental results demonstrate up to 70% reduction in K cache and 16–18% reduction in V cache, with minimal impact on end-to-end model performance.
This method introduces a new direction for KV cache compression that is orthogonal to existing approaches such as token-level compression and quantization. In addition, LeanK observes that the channel-wise norm distribution of QK vectors in each attention head correlates with the head’s retrieval capacity. This insight opens up promising avenues for enhancing the long-context understanding abilities of LLMs.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.02215 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.02215 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.02215 in a Space README.md to link it from this page.