Apollo: An Exploration of Video Understanding in Large Multimodal Models
Abstract
Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing 7B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.
Community
Cosa si vede nella foto
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding (2024)
- SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context (2024)
- VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation (2024)
- VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation (2024)
- AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM (2024)
- TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models (2024)
- TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
can i know when the code and model weights will be available again?
Unfortunately, the code, model and space demo have all been pulled for some reason (PR of meta releasing a Qwen based model?).
The weights have been mirrored to https://huggingface.co/GoodiesHere and the inference code has been mirrored to https://huggingface.co/manysuch-cases/Apollo-Github-Files.
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper