
Qwen/Qwen2.5-Omni-7B
Any-to-Any
•
Updated
•
408k
•
1.66k
This collection includes all the models, datasets and Spaces mentioned in the blog Vision Language Models: 2025 Update
Generate text and speech responses from text, images, or audio input
A unified multimodal understanding and generation model.
Chat with Kimi-VL-A3B-Thinking using text and images
Generate text responses using images and text prompts
Answer questions using images or videos
Generate responses using images and text input
Annotate and describe images with text prompts
Generate text or segment objects from an image
Demo for ShieldGemma 2, multimodal safety model
Check if text and images are safe
Chat with an AI that understands text and images
Chat with images and videos using Qwen
Generate responses to video or image inputs