Representation Shift: Unifying Token Compression with FlashAttention
Abstract
Representation Shift is a training-free, model-agnostic metric that integrates token compression with FlashAttention, enabling significant speedups in video-text retrieval and video QA.
Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token's representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model (2025)
- Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers (2025)
- DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba (2025)
- Training-free Token Reduction for Vision Mamba (2025)
- AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding (2025)
- HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models (2025)
- High-Layer Attention Pruning with Rescaling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper