ViLAMP-llava-qwen
ViLAMP is a video-language model for hour-long video understanding, addressing computational bottlenecks in long-form processing through differential distillation. It employs two mechanisms: (1) query-aware keyframe selection and (2) patch-level feature merging to preserve salient details in non-keyframes. ViLAMP achieves state-of-the-art performance on long-video benchmarks while enabling efficient processing of 10K-frame videos on a single GPU, balancing accuracy and computational efficiency.
- Downloads last month
- 24
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The model has no library tag.