ZeroSearch: Incentivize the Search Capability of LLMs without Searching Paper • 2505.04588 • Published 21 days ago • 63
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models Paper • 2504.10479 • Published Apr 14 • 265
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency Paper • 2502.09621 • Published Feb 13 • 28
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper • 2501.13106 • Published Jan 22 • 91
Qwen2-VL Collection Vision-language model series based on Qwen2 • 16 items • Updated 30 days ago • 216
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation Paper • 2411.08380 • Published Nov 13, 2024 • 27
Chain of Ideas: Revolutionizing Research in Novel Idea Development with LLM Agents Paper • 2410.13185 • Published Oct 17, 2024 • 6
LLaVA-Video Collection Models focus on video understanding (previously known as LLaVA-NeXT-Video). • 8 items • Updated Feb 21 • 61
LLaVA-Critic Collection as a general evaluator for assessing model performance • 6 items • Updated Oct 6, 2024 • 10
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper • 2410.17243 • Published Oct 22, 2024 • 95
LLaVA-Critic: Learning to Evaluate Multimodal Models Paper • 2410.02712 • Published Oct 3, 2024 • 37
view article Article ColPali: Efficient Document Retrieval with Vision Language Models 👀 By manu • Jul 5, 2024 • 253
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages Paper • 2407.19672 • Published Jul 29, 2024 • 58