Vision Language Models: 2025 Update

sergiopaniego 's Collections

updated May 12

This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update

Upvote

Qwen/Qwen2.5-Omni-7B

Any-to-Any • Updated Apr 30 • 408k • 1.66k
Running

320

320

Qwen2.5 Omni 7B Demo

🏆

Generate text and speech responses from text, images, or audio input
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26 • 158
openbmb/MiniCPM-o-2_6

Any-to-Any • Updated May 13 • 238k • 1.16k
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1 • 89.6k • 3.41k
Running on Zero

1.98k

1.98k

Chat With Janus-Pro-7B

🌍

A unified multimodal understanding and generation model.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Paper • 2501.17811 • Published Jan 29 • 6
Qwen/QVQ-72B-Preview

Image-Text-to-Text • Updated Jan 12 • 45.1k • 593
moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • Updated Apr 20 • 61.1k • 411
Running on Zero

103

103

Chat with Kimi-VL-A3B-Thinking

🤔

Chat with Kimi-VL-A3B-Thinking using text and images
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10 • 129
moonshotai/MoonViT-SO-400M

Image Feature Extraction • Updated Apr 17 • 213 • 16
google/siglip-so400m-patch14-384

Zero-Shot Image Classification • Updated Sep 26, 2024 • 4.02M • 555
moonshotai/Kimi-VL-A3B-Instruct

Image-Text-to-Text • Updated Apr 20 • 222k • 196
HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • Updated Apr 8 • 206k • 490
Running on Zero

138

138

SmolVLM

📊

Generate text responses using images and text prompts
HuggingFaceTB/SmolVLM2-2.2B-Instruct

Image-Text-to-Text • Updated Apr 8 • 59.5k • 206
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 189
Running on Zero

77

77

SmolVLM

📊

Answer questions using images or videos
google/gemma-3-27b-it

Image-Text-to-Text • Updated Mar 21 • 390k • • 1.43k
unsloth/gemma-3-27b-it-GGUF

Image-Text-to-Text • Updated May 12 • 48.1k • 134
google/gemma-3-27b-it-qat-q4_0-gguf

Image-Text-to-Text • Updated Apr 11 • 11.6k • 298
meta-llama/Llama-4-Scout-17B-16E-Instruct

Image-Text-to-Text • Updated 24 days ago • 548k • • 944
meta-llama/Llama-4-Maverick-17B-128E-Instruct

Image-Text-to-Text • Updated 24 days ago • 53.7k • • 347
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Paper • 2401.15947 • Published Jan 29, 2024 • 54
deepseek-ai/deepseek-vl2

Image-Text-to-Text • Updated Dec 18, 2024 • 7.14k • 336
Running on Zero

491

491

Chat with DeepSeek-VL2-small

🌍

Generate responses using images and text input
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Paper • 2412.10302 • Published Dec 13, 2024 • 18
lerobot/pi0

Robotics • Updated Mar 6 • 12.2k • 264
lerobot/pi0fast_base

Robotics • Updated Mar 31 • 1.84k • 23
nvidia/GR00T-N1-2B

Robotics • Updated Mar 18 • 7.24k • 318
google/paligemma-3b-pt-224

Image-Text-to-Text • Updated Sep 21, 2024 • 49.4k • 329
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 71
Paused

313

313

PaliGemma Demo

🤲

Annotate and describe images with text prompts
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 134
Running on Zero

91

91

Paligemma2 Mix

🌖

Generate text or segment objects from an image
google/paligemma2-10b-mix-448

Image-Text-to-Text • Updated Feb 7 • 15.5k • 30
allenai/Molmo-72B-0924

Image-Text-to-Text • Updated Oct 10, 2024 • 2.28k • 284
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 117
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • Updated 10 days ago • 870k • • 483
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19 • 193
google/shieldgemma-2-4b-it

Image-Text-to-Text • Updated Apr 4 • 11.4k • 109
ShieldGemma 2: Robust and Tractable Image Content Moderation

Paper • 2504.01081 • Published Apr 1 • 3
Running on Zero

11

11

ShieldGemma2 VLM

📉

Demo for ShieldGemma 2, multimodal safety model
meta-llama/Llama-Guard-4-12B

Image-Text-to-Text • Updated Apr 29 • 57.3k • • 44
Running on Zero

Llama Guard 4

🦀

Check if text and images are safe
Running

246

246

Qwen2.5 VL 72B Instruct

💻

Chat with an AI that understands text and images
marco/mcdse-2b-v1

Updated Oct 29, 2024 • 5.2k • 56
vidore/colpali-v1.3

Visual Document Retrieval • Updated Mar 14 • 242k • 53
ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 49
vidore/colqwen2.5-v0.2

Visual Document Retrieval • Updated 10 days ago • 38.2k • 50
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14 • 5.88k • 52
Qwen/Qwen2.5-VL-32B-Instruct

Image-Text-to-Text • Updated Apr 14 • 481k • • 384
Running

130

130

Qwen2.5 VL 32B Instruct Demo

🏃

Chat with images and videos using Qwen
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • Updated Feb 28 • 1.09k • 72
Running on Zero

79

79

LongVU

🌖

Generate responses to video or image inputs
openbmb/RLAIF-V-Dataset

Viewer • Updated Mar 4 • 74.8k • 1.25k • 175
HuggingFaceH4/rlaif-v_formatted

Viewer • Updated Jul 2, 2024 • 83.1k • 729 • 10
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Paper • 2404.16006 • Published Apr 24, 2024
Kaining/MMT-Bench

Viewer • Updated Jun 21, 2024 • 30k • 37 • 10
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Paper • 2409.02813 • Published Sep 4, 2024 • 32
MMMU/MMMU_Pro

Viewer • Updated Mar 8 • 5.19k • 3.99k • 27
reducto/RolmOCR

Image-Text-to-Text • Updated Apr 2 • 123k • 424
Alpha-VLLM/Lumina-mGPT-7B-768

Any-to-Any • Updated Apr 7 • 2.11k • 35
facebook/chameleon-7b

Image-Text-to-Text • Updated Jul 23, 2024 • 47.7k • 183

Upvote

Collection guide
Browse collections

Qwen2.5 Omni 7B Demo

Chat With Janus-Pro-7B

Chat with Kimi-VL-A3B-Thinking

SmolVLM

SmolVLM

Chat with DeepSeek-VL2-small

PaliGemma Demo

Paligemma2 Mix

ShieldGemma2 VLM

Llama Guard 4

Qwen2.5 VL 72B Instruct

Qwen2.5 VL 32B Instruct Demo

LongVU