VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper • 2501.01957 • Published 9 days ago • 34
VITA: Towards Open-Source Interactive Omni Multimodal LLM Paper • 2408.05211 • Published Aug 9, 2024 • 47
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models Paper • 2408.02085 • Published Aug 4, 2024 • 17
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Paper • 2405.21075 • Published May 31, 2024 • 21
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Paper • 2306.13394 • Published Jun 23, 2023
MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation Paper • 2308.08239 • Published Aug 16, 2023 • 1
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration Paper • 2309.01131 • Published Sep 3, 2023 • 1
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation Paper • 2308.04197 • Published Aug 8, 2023
Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval Paper • 2308.04008 • Published Aug 8, 2023
Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion Paper • 2009.05757 • Published Sep 12, 2020
Woodpecker: Hallucination Correction for Multimodal Large Language Models Paper • 2310.16045 • Published Oct 24, 2023 • 15
A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise Paper • 2312.12436 • Published Dec 19, 2023 • 13