we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector ππ»
πΌοΈ VLMs/OCR > moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos π (OS) > nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)
π£οΈ Audio > Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4) > kyutai released new speech-to-text models that come in 1B & 2B (kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay
y'all have been asking my opinion on how OCR models compare to each other π I will leave three apps to compare newest models by @prithivMLmods instead β€΅οΈ > compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR > SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2 > docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR
so far I figured out > for fact-checks, you need a relatively bigger size (7B is ok!) > Gemma 3 gets downgrade without pan and scan (especially for π) > Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks π£οΈ
the method is simple: find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both
their method is both training-free and for fine-tuning the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B π€―
removing redundant tokens improve image token quality too π₯Ή
we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face π₯ use them right away! it's where the community populates optimized kernels π€
this release comes in three parts > Kernel Hub: contains (as of now) 14 kernels > kernels: Python library to load kernels from Kernel Hub > kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)
when building models, your regular workflow should be pulling kernels from Hub and building your model with them π€ here's a practical example with RMSNorm: 1. pull the kernel from Hub with get_kernel 2. decorate with use_kernel_forward_from_hub 3. inject it to your model we'd love to hear your feedback! ππ» we also welcome kernel contributions by community π₯Ήπ
Dolphin: new OCR model by ByteDance with MIT license π¬
the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation Model: ByteDance/Dolphin Try the demo: ByteDance/Dolphin
stop building parser pipelines ππ» there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! π±
echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document π€ > the authors show in the paper that document parsing pipelines often have errors propagating back > using singular e2e models are better but they're too heavy to use
this model addresses both: it's lighter, faster, stronger π₯
> based on ViT, different sizes (L/G/H) and resolution (286/384) > 0-day support in π€ transformers > comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.
Key capabilities:
- AI-powered semantic search for models and datasets - Parameter count analysis via safetensors metadata - Trending content discovery - Find similar models/datasets functionality - 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it π₯Ή > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B π£οΈ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive β―οΈ based on Qwen/Qwen2.5-Omni-7B