
Vision Language Models: 2025 Update
This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update
- Any-to-Any • Updated • 203k • 1.62k
- 306
Qwen2.5 Omni 7B Demo
🏆Generate text and speech responses from text, images, or audio input
Qwen2.5-Omni Technical Report
Paper • 2503.20215 • Published • 152openbmb/MiniCPM-o-2_6
Any-to-Any • Updated • 431k • 1.14kdeepseek-ai/Janus-Pro-7B
Any-to-Any • Updated • 92.4k • 3.39k- 1.97k
Chat With Janus-Pro-7B
🌍A unified multimodal understanding and generation model.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Paper • 2501.17811 • Published • 6Qwen/QVQ-72B-Preview
Image-Text-to-Text • Updated • 45.9k • • 589moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • Updated • 48.9k • 403- 100
Chat with Kimi-VL-A3B-Thinking
🤔Chat with Kimi-VL-A3B-Thinking using text and images
Kimi-VL Technical Report
Paper • 2504.07491 • Published • 124moonshotai/MoonViT-SO-400M
Image Feature Extraction • Updated • 486 • 15google/siglip-so400m-patch14-384
Zero-Shot Image Classification • Updated • 7.02M • 544moonshotai/Kimi-VL-A3B-Instruct
Image-Text-to-Text • Updated • 110k • 193HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text • Updated • 59.3k • 468- 136
SmolVLM
📊Generate text responses using images and text prompts
HuggingFaceTB/SmolVLM2-2.2B-Instruct
Image-Text-to-Text • Updated • 84.1k • 191SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 184- 76
SmolVLM
📊Generate descriptions and answers from images and videos
google/gemma-3-27b-it
Image-Text-to-Text • Updated • 396k • • 1.38kunsloth/gemma-3-27b-it-GGUF
Image-Text-to-Text • Updated • 62k • 122google/gemma-3-27b-it-qat-q4_0-gguf
Image-Text-to-Text • Updated • 11.5k • 287meta-llama/Llama-4-Scout-17B-16E-Instruct
Image-Text-to-Text • Updated • 320k • • 915meta-llama/Llama-4-Maverick-17B-128E-Instruct
Image-Text-to-Text • Updated • 46.6k • • 334MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 53deepseek-ai/deepseek-vl2
Image-Text-to-Text • Updated • 9.69k • 331- 473
Chat with DeepSeek-VL2-small
🌍Generate responses using images and text input
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Paper • 2412.10302 • Published • 18lerobot/pi0
Robotics • Updated • 12.3k • 247lerobot/pi0fast_base
Robotics • Updated • 1.51k • 16nvidia/GR00T-N1-2B
Robotics • Updated • 5.38k • 304google/paligemma-3b-pt-224
Image-Text-to-Text • Updated • 28.2k • 329PaliGemma: A versatile 3B VLM for transfer
Paper • 2407.07726 • Published • 71- 313
PaliGemma Demo
🤲Annotate and describe images with text prompts
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper • 2412.03555 • Published • 134- 90
Paligemma2 Mix
🌖Generate text or segment objects from an image
google/paligemma2-10b-mix-448
Image-Text-to-Text • Updated • 19.5k • 27allenai/Molmo-72B-0924
Image-Text-to-Text • Updated • 1.76k • 284Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 114Qwen/Qwen2.5-VL-72B-Instruct
Image-Text-to-Text • Updated • 185k • • 464Qwen2.5-VL Technical Report
Paper • 2502.13923 • Published • 187google/shieldgemma-2-4b-it
Image-Text-to-Text • Updated • 26.7k • 98ShieldGemma 2: Robust and Tractable Image Content Moderation
Paper • 2504.01081 • Published • 3- 11
ShieldGemma2 VLM
📉Demo for ShieldGemma 2, multimodal safety model
meta-llama/Llama-Guard-4-12B
Image-Text-to-Text • Updated • 22.4k • 33Llama Guard 4
🦀Check if text and images are safe
- 241
Qwen2.5 VL 72B Instruct
💻Chat with an AI that understands text and images
marco/mcdse-2b-v1
Updated • 3.93k • 54vidore/colpali-v1.3
Visual Document Retrieval • Updated • 120k • 46ColPali: Efficient Document Retrieval with Vision Language Models
Paper • 2407.01449 • Published • 48vidore/colqwen2.5-v0.2
Visual Document Retrieval • Updated • 11.9k • 35vidore/colsmolvlm-v0.1
Visual Document Retrieval • Updated • 688 • 52Qwen/Qwen2.5-VL-32B-Instruct
Image-Text-to-Text • Updated • 385k • • 371- 117
Qwen2.5 VL 32B Instruct Demo
🏃Chat with images and videos using Qwen
Vision-CAIR/LongVU_Qwen2_7B
Video-Text-to-Text • Updated • 1.3k • 71- 79
LongVU
🌖Generate responses to video or image inputs
openbmb/RLAIF-V-Dataset
Viewer • Updated • 74.8k • 1.51k • 173HuggingFaceH4/rlaif-v_formatted
Viewer • Updated • 83.1k • 365 • 10MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Paper • 2404.16006 • PublishedKaining/MMT-Bench
Viewer • Updated • 30k • 74 • 10MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Paper • 2409.02813 • Published • 31MMMU/MMMU_Pro
Viewer • Updated • 5.19k • 6.04k • 26reducto/RolmOCR
Image-Text-to-Text • Updated • 139k • 409Alpha-VLLM/Lumina-mGPT-7B-768
Any-to-Any • Updated • 9.8k • 35facebook/chameleon-7b
Image-Text-to-Text • Updated • 33.4k • 180