QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Abstract
QuickVideo accelerates long-video understanding by combining a parallelized video decoder, memory-efficient prefilling, and overlapping video decoding with inference, enabling real-time performance.
Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding (2025)
- TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos (2025)
- LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval (2025)
- Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model (2025)
- LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding (2025)
- Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models (2025)
- Clapper: Compact Learning and Video Representation in VLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper