vision LMs are saturated over benchmarks, so we built vibe eval ๐ฌ
> compare different models with refreshed in-the-wild examples in different categories ๐ค > submit your favorite model for eval no numbers -- just vibes!
emerging trend: models that can understand image + text and generate image + text
don't miss out โคต๏ธ > MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA > BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL both by ByteDance! ๐ฑ
multimodal ๐ฌ๐ผ๏ธ > new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) ๐ > ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM ๐ฌ (OS) > Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance
> MMaDa is a new 8B diffusion language model that can generate image and text
LLMs > Mistral released Devstral, a 24B coding assistant (OS) ๐ฉ๐ปโ๐ป > Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS) > NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model > sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS) > samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)
image generation ๐จ > MTVCrafter is a new human motion animation generator
> first reasoning model for robotics > based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM ๐ค > comes with SFT & alignment datasets and a new benchmark ๐
LLMs ๐ฌ > Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS) > II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet > TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)
Multimodal ๐ผ๏ธ๐ฌ > Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output ๐ฌitโs based on an image encoder, a text decoder and a DiT, and comes in 8B > They also released pre-training and fine-tuning datasets > MMMG is a multimodal generation benchmark for image, audio, text (interleaved)
Image Generation โฏ๏ธ > Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS) > ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS) > multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux > LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks > Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators
Audio ๐ฃ๏ธ > stabilityai released stable-audio-open-small new text-to-audio model > TEN-framework released ten-vad, voice activity detection model (OS)
We just shipped a blog on everything latest on vision language models, including ๐ค GUI agents, agentic VLMs, omni models ๐ multimodal RAG โฏ๏ธ video LMs ๐ค๐ป smol models ..and more! https://huggingface.co/blog/vlms-2025
๐ฌ Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B ๐คฏ as well as Qwen2.5-Omni, any-to-any model in 3B and 7B! > Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes) > NVIDIA released new CoT reasoning datasets ๐ผ๏ธ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model > Meta released EdgeTAM, an on-device object tracking model (SAM2 variant) ๐ฃ๏ธ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model > Nari released Dia, a 1.6B text-to-speech model > Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model ๐ฉ๐ปโ๐ป JetBrains released Melium models in base and SFT for coding > Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model ๐คฉ
you can easily fine-tune, quantize, play with sota vision LM InternVL3 now ๐ฅ we have recently merged InternVL3 to Hugging Face transformers and released converted checkpoints ๐ค
Meta released Llama Guard 4 and new Prompt Guard 2 models ๐ฅ
Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image ๐ก๏ธ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B
Meta dropped swiss army knives for vision with A2.0 license ๐ > image/video encoders for vision language modelling and spatial understanding (object detection etc) ๐ > The vision LM outperforms InternVL3 and Qwen2.5VL ๐ > They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 ๐
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models ๐ฎ
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
Authors release following datasets ๐ > PE Video: Gigantic video datasete of 1M videos with 120k expert annotations โฏ๏ธ > PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks > PLM-VideoBench: New video benchmark on MCQA
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset ๐
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization ๐