ViLAMP-llava-qwen

ViLAMP is a video-language model for hour-long video understanding, addressing computational bottlenecks in long-form processing through differential distillation. It employs two mechanisms: (1) query-aware keyframe selection and (2) patch-level feature merging to preserve salient details in non-keyframes. ViLAMP achieves state-of-the-art performance on long-video benchmarks while enabling efficient processing of 10K-frame videos on a single GPU, balancing accuracy and computational efficiency.

[📂 GitHub] [📜 Paper]