KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
Abstract
KV Packet is a cache reuse framework that eliminates recomputation overhead in large language models by treating cached documents as immutable packets with trainable soft-token adapters.
Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.
Community
KV Packet is a framework for reusing precomputed KV caches across documents without recomputation. Code available at https://github.com/ChuangtaoChen-TUM/KVPacket.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context (2026)
- An experimental study of KV cache reuse strategies in chunk-level caching systems (2026)
- ICaRus: Identical Cache Reuse for Efficient Multi Model Inference (2026)
- QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference (2026)
- RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse (2026)
- ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs (2026)
- AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.13226 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper