Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Abstract
A spatio-temporal token merging method improves video LLM efficiency by exploiting redundancy, achieving significant speed-ups with minimal accuracy loss.
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2times speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3times speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at https://www.jshyun.me/projects/sttm.
Community
Double the Speed, Zero Training: The Free Lunch for Video LLMs!
Long videos slow things down β LLMs must prefill massive context before responding. We introduce STTM, the first training-free spatio-temporal token merging for Video LLMs. Even better, itβs query-agnostic β so reduced KV caches can be reused across multiple questions for the same video.
π TL;DR
π§© Merging mechanism. (1) Coarse-to-fine spatial token merging per frame. (2) Directed temporal merging of different granular spatial tokens across nearby frames .
π Model generalization. Validated with LLaVA-Video-7B/72B, LLaVA-OneVision-7B, and Qwen2VL-7B
π Dataset coverage. Evaluated on 6 video QA datasets covering 3 categories:
πΈ NIAH: VNBench
πΈ Long: VideoMME, LongVideoBench, MLVU
πΈ Short: EgoSchema, NExT-QA
β‘ Results
π LLaVA-Video-7B. (1) Under 50% tokens, 2.1Γ speed-up with 99.5% accuracy. (2) Under 30% tokens, 3.0Γ speed-up with 97.8% accuracy.
π LLaVA-OneVision-7B. (1) Under 50% tokens, 2.2Γ speed-up with 102.1% accuracy. (2) Under 30% tokens, 3.1Γ speed-up with 101.1% accuracy.
π Qwen2VL-7B. (1) Under 50% tokens, 2.6Γ speed-up with 102.7% accuracy. (2) Under 30% tokens, 4.5Γ speed-up with 100.5% accuracy.
π LLaVA-Video-72B. (1) Under 50% tokens, 2.3Γ speed-up with 101.3% accuracy. (2) Under 30% tokens, 3.3Γ speed-up with 99.1% accuracy.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper