the method is simple: find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both
their method is both training-free and for fine-tuning the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B ๐คฏ
removing redundant tokens improve image token quality too ๐ฅน
we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face ๐ฅ use them right away! it's where the community populates optimized kernels ๐ค
this release comes in three parts > Kernel Hub: contains (as of now) 14 kernels > kernels: Python library to load kernels from Kernel Hub > kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)
when building models, your regular workflow should be pulling kernels from Hub and building your model with them ๐ค here's a practical example with RMSNorm: 1. pull the kernel from Hub with get_kernel 2. decorate with use_kernel_forward_from_hub 3. inject it to your model we'd love to hear your feedback! ๐๐ป we also welcome kernel contributions by community ๐ฅน๐
Dolphin: new OCR model by ByteDance with MIT license ๐ฌ
the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation Model: ByteDance/Dolphin Try the demo: ByteDance/Dolphin
stop building parser pipelines ๐๐ป there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! ๐ฑ
echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document ๐ค > the authors show in the paper that document parsing pipelines often have errors propagating back > using singular e2e models are better but they're too heavy to use
this model addresses both: it's lighter, faster, stronger ๐ฅ
> based on ViT, different sizes (L/G/H) and resolution (286/384) > 0-day support in ๐ค transformers > comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it ๐ฅน > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B ๐ฃ๏ธ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive โฏ๏ธ based on Qwen/Qwen2.5-Omni-7B
vision LMs are saturated over benchmarks, so we built vibe eval ๐ฌ
> compare different models with refreshed in-the-wild examples in different categories ๐ค > submit your favorite model for eval no numbers -- just vibes!
emerging trend: models that can understand image + text and generate image + text
don't miss out โคต๏ธ > MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA > BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL both by ByteDance! ๐ฑ
multimodal ๐ฌ๐ผ๏ธ > new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) ๐ > ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM ๐ฌ (OS) > Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance
> MMaDa is a new 8B diffusion language model that can generate image and text
LLMs > Mistral released Devstral, a 24B coding assistant (OS) ๐ฉ๐ปโ๐ป > Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS) > NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model > sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS) > samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)
image generation ๐จ > MTVCrafter is a new human motion animation generator