HF Canonical Model Maintainers

non-profit
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

hf-maintainers's activity

merveย 
posted an update about 19 hours ago
view post
Post
1501
Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

๐Ÿ–ผ๏ธ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos ๐Ÿ‘ (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

๐Ÿ’ฌ LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

๐Ÿ—ฃ๏ธ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
merveย 
posted an update 2 days ago
merveย 
posted an update 4 days ago
merveย 
posted an update 5 days ago
view post
Post
1816
stop using VLMs blindly โœ‹๐Ÿป

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) ๐Ÿ”ฅ visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add ๐Ÿซก

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for ๐Ÿ“‘)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks ๐Ÿ—ฃ๏ธ
  • 2 replies
ยท
merveย 
posted an update 6 days ago
view post
Post
3529
Releases of the past week are here merve/releases-june-13-6852c3c1eaf1e0c24c958860

Here's our picks ๐Ÿค“
So many interesting models released past week in open AI! ๐Ÿค–

๐Ÿ–ผ๏ธ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)

Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B ๐Ÿคฏ) audio language model that takes in audio and generates audio (OS)

๐Ÿค– Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model

3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts
merveย 
posted an update 7 days ago
view post
Post
3483
IN: video fine-tuning support for facebook V-JEPA 2 in HF transformers ๐Ÿ”ฅ

it comes with
> four models fine-tuned on Diving48 and SSv2 dataset facebook/v-jepa-2-6841bad8413014e185b497a6
> FastRTC demo on V-JEPA2 SSv2 qubvel-hf/vjepa2-streaming-video-classification
> fine-tuning script on UCF-101 https://gist.github.com/ariG23498/28bccc737c11d1692f6d0ad2a0d7cddb
> fine-tuning notebook on UCF-101 https://colab.research.google.com/drive/16NWUReXTJBRhsN3umqznX4yoZt2I7VGc?usp=sharing
we're looking forward to see what you will build! ๐Ÿค—
merveย 
posted an update 8 days ago
view post
Post
2405
#CVPR2025 Paper Picks #1
VisionZip is a compression technique that reduces number of visual tokens to improve performance AND prefill time for vision language models
demo: Senqiao/VisionZip
paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models (2412.04467)
most of the image tokens are redundant for the LLM, so the authors ask "are all visual tokens necessary?"

the method is simple:
find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both

their method is both training-free and for fine-tuning
the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B ๐Ÿคฏ

removing redundant tokens improve image token quality too ๐Ÿฅน
merveย 
posted an update 8 days ago
view post
Post
3626
stop writing CUDA kernels yourself

we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face ๐Ÿ”ฅ use them right away!
it's where the community populates optimized kernels ๐Ÿค

this release comes in three parts
> Kernel Hub: contains (as of now) 14 kernels
> kernels: Python library to load kernels from Kernel Hub
> kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)

when building models, your regular workflow should be pulling kernels from Hub and building your model with them ๐Ÿค—
here's a practical example with RMSNorm:
1. pull the kernel from Hub with get_kernel
2. decorate with use_kernel_forward_from_hub
3. inject it to your model
we'd love to hear your feedback! ๐Ÿ™๐Ÿป
we also welcome kernel contributions by community ๐Ÿฅน๐Ÿ’—

- request kernels here: kernels-community/README#1
- check out this org: kernels-community
- read the blog: https://huggingface.co/blog/hello-hf-kernels
  • 1 reply
ยท
merveย 
posted an update 11 days ago
view post
Post
687
Dolphin: new OCR model by ByteDance with MIT license ๐Ÿฌ

the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation
Model: ByteDance/Dolphin
Try the demo: ByteDance/Dolphin
merveย 
posted an update 13 days ago
view post
Post
1361
stop building parser pipelines ๐Ÿ‘‹๐Ÿป
there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! ๐Ÿ˜ฑ

echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document ๐Ÿค 
> the authors show in the paper that document parsing pipelines often have errors propagating back
> using singular e2e models are better but they're too heavy to use

this model addresses both: it's lighter, faster, stronger ๐Ÿ”ฅ
merveย 
posted an update 13 days ago
view post
Post
1593
Meta just released V-JEPA 2: new open-source image/video world models โฏ๏ธ๐Ÿค— facebook/v-jepa-2-6841bad8413014e185b497a6

> based on ViT, different sizes (L/G/H) and resolution (286/384)
> 0-day support in ๐Ÿค— transformers
> comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard

Read more https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
We will release a fine-tuning notebook with task-specific models in transformers format soon, stay tuned!
davanstrienย 
posted an update 15 days ago
view post
Post
2761
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp
merveย 
posted an update 19 days ago
view post
Post
2860
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it ๐Ÿฅน
> KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B ๐Ÿ—ฃ๏ธ
> Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive โฏ๏ธ based on Qwen/Qwen2.5-Omni-7B
merveย 
posted an update 20 days ago
view post
Post
1520
Past week was insanely packed for open AI! ๐Ÿ˜ฑ
Luckily we picked some highlights for you โค๏ธ lfg!

๐Ÿ’ฌ LLMs/VLMs
> Deepseek ๐Ÿณ released deepseek-ai/DeepSeek-R1-0528, 38B model, only 0.2 and 1.4 points behind o3 in AIME 24/25 ๐Ÿคฏ they also released an 8B distilled version based on Qwen3 (OS) deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d
> Xiaomi released MiMo-7B-RL (LLM for code and math) and MiMo-VL-7B-RL (VLM for visual reasoning, GUI agentic task and general use) (OS) ๐Ÿ˜ XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212
> NVIDIA released , new reasoning model nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
> DS: MiniMax released https://huggingface.co/MiniMaxAI/SynLogic, new 49k logical reasoning examples across 35 tasks including solving cipher, sudoku and more!

๐Ÿ–ผ๏ธ Image/Video Generation
> tencent released tencent/HunyuanPortrait, a new model for consistent portrait generation with SVD Research license. They also released tencent/HunyuanVideo-Avatar, audio driven avatar generation (OS)
> showlab released showlab/OmniConsistency, consistent stylization model (OS)
> Rapidata/text-2-video-human-preferences-veo3 is a new T2V preference dataset based on videos from Veo3 with 46k examples (OS)

Audio๐Ÿ—ฃ๏ธ
> https://huggingface.co/ResembleAI/Chatterbox is a new 500M text-to-speech model preferred more than ElevenLabs (OS) ๐Ÿ˜
> PlayHT/PlayDiffusion is a new speech editing model (OS)

Other
> https://huggingface.co/NX-AI/TiReX is a new time series foundation model
> Yandex released a huge (4.79B examples!) video recommendation dataset https://huggingface.co/yandex/yambda

OS ones have Apache2.0 or MIT licenses, find more models and datasets here merve/releases-30-may-6840097345e0b1e915bff843
merveย 
posted an update 20 days ago
view post
Post
1403
Yesterday was the day of vision language action models (VLAs)!

> SmolVLA: open-source small VLA for robotics by Hugging Face LeRobot team ๐Ÿค–
Blog: https://huggingface.co/blog/smolvla
Model: lerobot/smolvla_base

> Holo-1: 3B & 7B web/computer use agentic VLAs by H Company ๐Ÿ’ป
Model family: Hcompany/holo1-683dd1eece7eb077b96d0cbd
Demo: https://huggingface.co/spaces/multimodalart/Holo1
Blog: https://huggingface.co/blog/Hcompany/holo1
super exciting times!!
merveย 
posted an update 21 days ago
merveย 
posted an update 22 days ago
merveย 
posted an update 23 days ago
view post
Post
1147
New GUI model by Salesforce AI & Uni HK: Jedi
tianbaoxiexxx/Jedi xlangai/Jedi-7B-1080p ๐Ÿค—
Based on Qwen2.5-VL with Apache 2.0 license

prompt with below screenshot โ†’ select "find more"
  • 3 replies
ยท
merveย 
posted an update 25 days ago
view post
Post
1979
HOT: MiMo-VL new 7B vision LMs by Xiaomi surpassing gpt-4o (Mar), competitive in GUI agentic + reasoning tasks โค๏ธโ€๐Ÿ”ฅ XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212

not only that, but also MIT license & usable with transformers ๐Ÿ”ฅ
merveย 
posted an update 26 days ago
view post
Post
2722
introducing: VLM vibe eval ๐Ÿชญ https://huggingface.co/spaces/visionLMsftw/VLMVibeEval

vision LMs are saturated over benchmarks, so we built vibe eval ๐Ÿ’ฌ

> compare different models with refreshed in-the-wild examples in different categories ๐Ÿค 
> submit your favorite model for eval
no numbers -- just vibes!