BRAVE: Broadening the visual encoding of vision-language models Paper β’ 2404.07204 β’ Published Apr 10 β’ 18
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models Paper β’ 2403.18814 β’ Published Mar 27 β’ 45
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models Paper β’ 2409.17146 β’ Published Sep 25 β’ 104
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Paper β’ 2407.07895 β’ Published Jul 10 β’ 40
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model Paper β’ 2409.01704 β’ Published Sep 3 β’ 83
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper β’ 2409.12191 β’ Published Sep 18 β’ 74
Unifying Multimodal Retrieval via Document Screenshot Embedding Paper β’ 2406.11251 β’ Published Jun 17 β’ 9
ColPali: Efficient Document Retrieval with Vision Language Models Paper β’ 2407.01449 β’ Published Jun 27 β’ 42
Building and better understanding vision-language models: insights and future directions Paper β’ 2408.12637 β’ Published Aug 22 β’ 124