๐ผ๏ธ VLMs/OCR > moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos ๐ (OS) > nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)
๐ฃ๏ธ Audio > Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4) > kyutai released new speech-to-text models that come in 1B & 2B (kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay
y'all have been asking my opinion on how OCR models compare to each other ๐ I will leave three apps to compare newest models by @prithivMLmods instead โคต๏ธ > compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR > SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2 > docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR
so far I figured out > for fact-checks, you need a relatively bigger size (7B is ok!) > Gemma 3 gets downgrade without pan and scan (especially for ๐) > Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks ๐ฃ๏ธ
the method is simple: find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both
their method is both training-free and for fine-tuning the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B ๐คฏ
removing redundant tokens improve image token quality too ๐ฅน
we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face ๐ฅ use them right away! it's where the community populates optimized kernels ๐ค
this release comes in three parts > Kernel Hub: contains (as of now) 14 kernels > kernels: Python library to load kernels from Kernel Hub > kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)
when building models, your regular workflow should be pulling kernels from Hub and building your model with them ๐ค here's a practical example with RMSNorm: 1. pull the kernel from Hub with get_kernel 2. decorate with use_kernel_forward_from_hub 3. inject it to your model we'd love to hear your feedback! ๐๐ป we also welcome kernel contributions by community ๐ฅน๐
Dolphin: new OCR model by ByteDance with MIT license ๐ฌ
the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation Model: ByteDance/Dolphin Try the demo: ByteDance/Dolphin
stop building parser pipelines ๐๐ป there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! ๐ฑ
echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document ๐ค > the authors show in the paper that document parsing pipelines often have errors propagating back > using singular e2e models are better but they're too heavy to use
this model addresses both: it's lighter, faster, stronger ๐ฅ
> based on ViT, different sizes (L/G/H) and resolution (286/384) > 0-day support in ๐ค transformers > comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.
Key capabilities:
- AI-powered semantic search for models and datasets - Parameter count analysis via safetensors metadata - Trending content discovery - Find similar models/datasets functionality - 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it ๐ฅน > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B ๐ฃ๏ธ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive โฏ๏ธ based on Qwen/Qwen2.5-Omni-7B
vision LMs are saturated over benchmarks, so we built vibe eval ๐ฌ
> compare different models with refreshed in-the-wild examples in different categories ๐ค > submit your favorite model for eval no numbers -- just vibes!