Is Heuristic Sampling Necessary in Training Deep Object Detectors? Paper • 1909.04868 • Published Sep 11, 2019
Bootstrapping SparseFormers from Vision Foundation Models Paper • 2312.01987 • Published Dec 4, 2023 • 1
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Paper • 2306.08640 • Published Jun 14, 2023 • 26
Learning Video Context as Interleaved Multimodal Sequences Paper • 2407.21757 • Published Jul 31, 2024
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation Paper • 2408.16730 • Published Aug 29, 2024
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Paper • 2504.16030 • Published Apr 22 • 38
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos Paper • 2409.19603 • Published Sep 29, 2024 • 19
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published Jun 17, 2024 • 25
UniVTG: Towards Unified Video-Language Temporal Grounding Paper • 2307.16715 • Published Jul 31, 2023 • 11