mv

community

AI & ML interests

None defined yet.

mv11's activity

merveย 
posted an update 2 days ago
view post
Post
438
Dolphin: new OCR model by ByteDance with MIT license ๐Ÿฌ

the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation
Model: ByteDance/Dolphin
Try the demo: ByteDance/Dolphin
merveย 
posted an update 4 days ago
view post
Post
1270
stop building parser pipelines ๐Ÿ‘‹๐Ÿป
there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! ๐Ÿ˜ฑ

echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document ๐Ÿค 
> the authors show in the paper that document parsing pipelines often have errors propagating back
> using singular e2e models are better but they're too heavy to use

this model addresses both: it's lighter, faster, stronger ๐Ÿ”ฅ
merveย 
posted an update 4 days ago
view post
Post
1506
Meta just released V-JEPA 2: new open-source image/video world models โฏ๏ธ๐Ÿค— facebook/v-jepa-2-6841bad8413014e185b497a6

> based on ViT, different sizes (L/G/H) and resolution (286/384)
> 0-day support in ๐Ÿค— transformers
> comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard

Read more https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
We will release a fine-tuning notebook with task-specific models in transformers format soon, stay tuned!
merveย 
posted an update 10 days ago
view post
Post
2821
Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it ๐Ÿฅน
> KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B ๐Ÿ—ฃ๏ธ
> Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive โฏ๏ธ based on Qwen/Qwen2.5-Omni-7B
merveย 
posted an update 11 days ago
view post
Post
1506
Past week was insanely packed for open AI! ๐Ÿ˜ฑ
Luckily we picked some highlights for you โค๏ธ lfg!

๐Ÿ’ฌ LLMs/VLMs
> Deepseek ๐Ÿณ released deepseek-ai/DeepSeek-R1-0528, 38B model, only 0.2 and 1.4 points behind o3 in AIME 24/25 ๐Ÿคฏ they also released an 8B distilled version based on Qwen3 (OS) deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d
> Xiaomi released MiMo-7B-RL (LLM for code and math) and MiMo-VL-7B-RL (VLM for visual reasoning, GUI agentic task and general use) (OS) ๐Ÿ˜ XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212
> NVIDIA released , new reasoning model nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
> DS: MiniMax released https://huggingface.co/MiniMaxAI/SynLogic, new 49k logical reasoning examples across 35 tasks including solving cipher, sudoku and more!

๐Ÿ–ผ๏ธ Image/Video Generation
> tencent released tencent/HunyuanPortrait, a new model for consistent portrait generation with SVD Research license. They also released tencent/HunyuanVideo-Avatar, audio driven avatar generation (OS)
> showlab released showlab/OmniConsistency, consistent stylization model (OS)
> Rapidata/text-2-video-human-preferences-veo3 is a new T2V preference dataset based on videos from Veo3 with 46k examples (OS)

Audio๐Ÿ—ฃ๏ธ
> https://huggingface.co/ResembleAI/Chatterbox is a new 500M text-to-speech model preferred more than ElevenLabs (OS) ๐Ÿ˜
> PlayHT/PlayDiffusion is a new speech editing model (OS)

Other
> https://huggingface.co/NX-AI/TiReX is a new time series foundation model
> Yandex released a huge (4.79B examples!) video recommendation dataset https://huggingface.co/yandex/yambda

OS ones have Apache2.0 or MIT licenses, find more models and datasets here merve/releases-30-may-6840097345e0b1e915bff843
merveย 
posted an update 11 days ago
view post
Post
1391
Yesterday was the day of vision language action models (VLAs)!

> SmolVLA: open-source small VLA for robotics by Hugging Face LeRobot team ๐Ÿค–
Blog: https://huggingface.co/blog/smolvla
Model: lerobot/smolvla_base

> Holo-1: 3B & 7B web/computer use agentic VLAs by H Company ๐Ÿ’ป
Model family: Hcompany/holo1-683dd1eece7eb077b96d0cbd
Demo: https://huggingface.co/spaces/multimodalart/Holo1
Blog: https://huggingface.co/blog/Hcompany/holo1
super exciting times!!
merveย 
posted an update 12 days ago
merveย 
posted an update 13 days ago
merveย 
posted an update 14 days ago
view post
Post
1132
New GUI model by Salesforce AI & Uni HK: Jedi
tianbaoxiexxx/Jedi xlangai/Jedi-7B-1080p ๐Ÿค—
Based on Qwen2.5-VL with Apache 2.0 license

prompt with below screenshot โ†’ select "find more"
  • 3 replies
ยท
merveย 
posted an update 16 days ago
view post
Post
1970
HOT: MiMo-VL new 7B vision LMs by Xiaomi surpassing gpt-4o (Mar), competitive in GUI agentic + reasoning tasks โค๏ธโ€๐Ÿ”ฅ XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212

not only that, but also MIT license & usable with transformers ๐Ÿ”ฅ
merveย 
posted an update 17 days ago
view post
Post
2712
introducing: VLM vibe eval ๐Ÿชญ visionLMsftw/VLMVibeEval

vision LMs are saturated over benchmarks, so we built vibe eval ๐Ÿ’ฌ

> compare different models with refreshed in-the-wild examples in different categories ๐Ÿค 
> submit your favorite model for eval
no numbers -- just vibes!
merveย 
posted an update 19 days ago
view post
Post
2547
emerging trend: models that can understand image + text and generate image + text

don't miss out โคต๏ธ
> MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA
> BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL
both by ByteDance! ๐Ÿ˜ฑ

I keep track of all any input โ†’ any output models here https://huggingface.co/collections/merve/any-to-any-models-6822042ee8eb7fb5e38f9b62
  • 1 reply
ยท
merveย 
posted an update 20 days ago
view post
Post
3126
what happened in open AI past week? so many vision LM & omni releases ๐Ÿ”ฅ merve/releases-23-may-68343cb970bbc359f9b5fb05

multimodal ๐Ÿ’ฌ๐Ÿ–ผ๏ธ
> new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) ๐ŸŒš
> ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM ๐Ÿฌ (OS)
> Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance

> MMaDa is a new 8B diffusion language model that can generate image and text



LLMs
> Mistral released Devstral, a 24B coding assistant (OS) ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป
> Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS)
> NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model
> sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS)
> samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)

image generation ๐ŸŽจ
> MTVCrafter is a new human motion animation generator
  • 1 reply
ยท
merveย 
posted an update 24 days ago
view post
Post
2595
Google released MedGemma on I/O'25 ๐Ÿ‘ google/medgemma-release-680aade845f90bec6a3f60c4

> 4B and 27B instruction fine-tuned vision LMs and a 4B pre-trained vision LM for medicine
> available with transformers from the get-go ๐Ÿค—

they also released a cool demo for scan reading โžก๏ธ google/rad_explain

use with transformers โคต๏ธ
  • 1 reply
ยท
merveย 
posted an update 24 days ago
view post
Post
3117
Bu post'u รงevirebilirsiniz ๐Ÿค—๐Ÿ’—
ยท
merveย 
posted an update 24 days ago
view post
Post
2389
tis the year of any-to-any/omni models ๐Ÿค 
ByteDance-Seed/BAGEL-7B-MoT 7B native multimodal model that understands and generates both image + text

it outperforms leading VLMs like Qwen 2.5-VL ๐Ÿ‘ and has Apache 2.0 license ๐Ÿ˜ฑ
joaoganteย 
posted an update 25 days ago
view post
Post
466
Let's go! Custom generation code has landed in transformers ๐Ÿš€

Have you designed a new cool KV cache? Maybe you're comparing new test-time compute ideas you've been researching? Have you found a way to do diffusion with existing models? You can now easily share your findings with the community with custom generation code, sharing the well-known generate interface ๐Ÿค“

In a nutshell, we have expanded the support of custom modeling code on the Hub with *model-agnostic* custom generation code. Write for one model, reuse with any model -- hopefully, this will democratize access to new generation ideas ๐Ÿซก

As a creator, you gain the ability to get your ideas in transformers with minimal effort. You'll also have access to all Hub features: a landing page for your creation, discussions, usage metrics, ... ๐Ÿค“

๐Ÿ’Ž Resources ๐Ÿ’Ž
- docs: https://huggingface.co/docs/transformers/generation_strategies#custom-decoding-methods
- minimal example: transformers-community/custom_generate_example
- discussion: transformers-community/support#10
merveย 
posted an update 26 days ago
view post
Post
1717
NVIDIA released new vision reasoning model for robotics: Cosmos-Reason1-7B ๐Ÿค– nvidia/cosmos-reason1-67c9e926206426008f1da1b7

> first reasoning model for robotics
> based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM ๐Ÿค—
> comes with SFT & alignment datasets and a new benchmark ๐Ÿ‘
merveย 
posted an update 27 days ago
view post
Post
2581
It was the week of video generation at @huggingface , on top of many new LLMs, VLMs and more!
Letโ€™s have a wrap ๐ŸŒฏ merve/may-16-releases-682aeed23b97eb0fe965345c

LLMs ๐Ÿ’ฌ
> Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS)
> II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet
> TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)

Multimodal ๐Ÿ–ผ๏ธ๐Ÿ’ฌ
> Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output ๐Ÿ’ฌitโ€™s based on an image encoder, a text decoder and a DiT, and comes in 8B
> They also released pre-training and fine-tuning datasets
> MMMG is a multimodal generation benchmark for image, audio, text (interleaved)

Image Generation โฏ๏ธ
> Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS)
> ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS)
> multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux
> LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks
> Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators

Audio ๐Ÿ—ฃ๏ธ
> stabilityai released stable-audio-open-small new text-to-audio model
> TEN-framework released ten-vad, voice activity detection model (OS)

merveย 
posted an update about 1 month ago
view post
Post
2278
New sota open-source depth estimation: Marigold v1-1 ๐ŸŒผ

> normal maps, depth maps of scenes & faces prs-eth/marigold-normals prs-eth/marigold
> get albedo (true color) and BRDF (texture) maps of scenes prs-eth/marigold-intrinsics
> they even release a depth-to-3D printer format demo ๐Ÿ˜ฎ prs-eth/depth-to-3d-print

All models are here prs-eth/marigold-computer-vision-6669e9e3d3ee30f48214b9ba