ViLAMP-llava-qwen

ViLAMP is a video-language model for hour-long video understanding, addressing computational bottlenecks in long-form processing through differential distillation. It employs two mechanisms: (1) query-aware keyframe selection and (2) patch-level feature merging to preserve salient details in non-keyframes. ViLAMP achieves state-of-the-art performance on long-video benchmarks while enabling efficient processing of 10K-frame videos on a single GPU, balancing accuracy and computational efficiency.

[๐Ÿ“‚ GitHub] [๐Ÿ“œ Paper]

Downloads last month
24
Safetensors
Model size
8.18B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support