AI & ML interests

Collection of JS libraries to interact with the Hugging Face Hub

Recent Activity

huggingfacejs's activity

merveย 
posted an update about 5 hours ago
merveย 
posted an update 1 day ago
view post
Post
381
New GUI model by Salesforce AI & Uni HK: Jedi
tianbaoxiexxx/Jedi xlangai/Jedi-7B-1080p ๐Ÿค—
Based on Qwen2.5-VL with Apache 2.0 license

prompt with below screenshot โ†’ select "find more"
merveย 
posted an update 3 days ago
view post
Post
1860
HOT: MiMo-VL new 7B vision LMs by Xiaomi surpassing gpt-4o (Mar), competitive in GUI agentic + reasoning tasks โค๏ธโ€๐Ÿ”ฅ XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212

not only that, but also MIT license & usable with transformers ๐Ÿ”ฅ
merveย 
posted an update 4 days ago
view post
Post
2628
introducing: VLM vibe eval ๐Ÿชญ visionLMsftw/VLMVibeEval

vision LMs are saturated over benchmarks, so we built vibe eval ๐Ÿ’ฌ

> compare different models with refreshed in-the-wild examples in different categories ๐Ÿค 
> submit your favorite model for eval
no numbers -- just vibes!
merveย 
posted an update 6 days ago
view post
Post
2500
emerging trend: models that can understand image + text and generate image + text

don't miss out โคต๏ธ
> MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA
> BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL
both by ByteDance! ๐Ÿ˜ฑ

I keep track of all any input โ†’ any output models here https://huggingface.co/collections/merve/any-to-any-models-6822042ee8eb7fb5e38f9b62
  • 1 reply
ยท
merveย 
posted an update 7 days ago
view post
Post
3084
what happened in open AI past week? so many vision LM & omni releases ๐Ÿ”ฅ merve/releases-23-may-68343cb970bbc359f9b5fb05

multimodal ๐Ÿ’ฌ๐Ÿ–ผ๏ธ
> new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) ๐ŸŒš
> ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM ๐Ÿฌ (OS)
> Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance

> MMaDa is a new 8B diffusion language model that can generate image and text



LLMs
> Mistral released Devstral, a 24B coding assistant (OS) ๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป
> Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS)
> NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model
> sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS)
> samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)

image generation ๐ŸŽจ
> MTVCrafter is a new human motion animation generator
  • 1 reply
ยท
merveย 
posted an update 11 days ago
view post
Post
2564
Google released MedGemma on I/O'25 ๐Ÿ‘ google/medgemma-release-680aade845f90bec6a3f60c4

> 4B and 27B instruction fine-tuned vision LMs and a 4B pre-trained vision LM for medicine
> available with transformers from the get-go ๐Ÿค—

they also released a cool demo for scan reading โžก๏ธ google/rad_explain

use with transformers โคต๏ธ
  • 1 reply
ยท
merveย 
posted an update 11 days ago
view post
Post
3077
Bu post'u รงevirebilirsiniz ๐Ÿค—๐Ÿ’—
ยท
merveย 
posted an update 11 days ago
view post
Post
2361
tis the year of any-to-any/omni models ๐Ÿค 
ByteDance-Seed/BAGEL-7B-MoT 7B native multimodal model that understands and generates both image + text

it outperforms leading VLMs like Qwen 2.5-VL ๐Ÿ‘ and has Apache 2.0 license ๐Ÿ˜ฑ
merveย 
in huggingfacejs/tasks 12 days ago

image-to-video

2
#7 opened 13 days ago by
multimodalart
merveย 
posted an update 13 days ago
view post
Post
1710
NVIDIA released new vision reasoning model for robotics: Cosmos-Reason1-7B ๐Ÿค– nvidia/cosmos-reason1-67c9e926206426008f1da1b7

> first reasoning model for robotics
> based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM ๐Ÿค—
> comes with SFT & alignment datasets and a new benchmark ๐Ÿ‘
merveย 
posted an update 14 days ago
view post
Post
2570
It was the week of video generation at @huggingface , on top of many new LLMs, VLMs and more!
Letโ€™s have a wrap ๐ŸŒฏ merve/may-16-releases-682aeed23b97eb0fe965345c

LLMs ๐Ÿ’ฌ
> Alibaba Qwen released WorldPM-72B, new World Preference Model trained with 15M preference samples (OS)
> II-Medical-8B, new LLM for medical reasoning that comes in 8B by Intelligent-Internet
> TRAIL is a new dataset by Patronus for trace error reasoning for agents (OS)

Multimodal ๐Ÿ–ผ๏ธ๐Ÿ’ฌ
> Salesforce Research released BLIP3o, a new any-to-any model with image-text input and image-text output ๐Ÿ’ฌitโ€™s based on an image encoder, a text decoder and a DiT, and comes in 8B
> They also released pre-training and fine-tuning datasets
> MMMG is a multimodal generation benchmark for image, audio, text (interleaved)

Image Generation โฏ๏ธ
> Alibaba Wan-AI released Wan2.1-VACE, video foundation model for image and text to video, video-to-audio and more tasks, comes in 1.3B and 14B (OS)
> ZuluVision released MoviiGen1.1, new cinematic video generation model based on Wan 2.1 14B (OS)
> multimodalart released isometric-skeumorphic-3d-bnb, an isometric 3D asset generator (like AirBnB assets) based on Flux
> LTX-Video-0.9.7-distilled is a new real-time video generation (text and image to video) model by Lightricks
> Hidream_t2i_human_preference is a new text-to-image preference dataset by Rapidata with 195k human responses from 38k annotators

Audio ๐Ÿ—ฃ๏ธ
> stabilityai released stable-audio-open-small new text-to-audio model
> TEN-framework released ten-vad, voice activity detection model (OS)

merveย 
posted an update 17 days ago
view post
Post
2268
New sota open-source depth estimation: Marigold v1-1 ๐ŸŒผ

> normal maps, depth maps of scenes & faces prs-eth/marigold-normals prs-eth/marigold
> get albedo (true color) and BRDF (texture) maps of scenes prs-eth/marigold-intrinsics
> they even release a depth-to-3D printer format demo ๐Ÿ˜ฎ prs-eth/depth-to-3d-print

All models are here prs-eth/marigold-computer-vision-6669e9e3d3ee30f48214b9ba
merveย 
posted an update 21 days ago
view post
Post
5019
VLMS 2025 UPDATE ๐Ÿ”ฅ

We just shipped a blog on everything latest on vision language models, including
๐Ÿค– GUI agents, agentic VLMs, omni models
๐Ÿ“‘ multimodal RAG
โฏ๏ธ video LMs
๐Ÿค๐Ÿป smol models
..and more! https://huggingface.co/blog/vlms-2025
  • 1 reply
ยท
merveย 
posted an update 27 days ago
view post
Post
5061
A ton of impactful models and datasets in open AI past week, let's summarize the best ๐Ÿคฉ merve/releases-apr-21-and-may-2-6819dcc84da4190620f448a3

๐Ÿ’ฌ Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B ๐Ÿคฏ as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
๐Ÿ–ผ๏ธ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
๐Ÿ—ฃ๏ธ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
๐Ÿ‘ฉ๐Ÿปโ€๐Ÿ’ป JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model ๐Ÿคฉ
merveย 
posted an update 28 days ago
view post
Post
6566
A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers ๐Ÿ”ฅ

D-FINE is the sota real-time object detector that runs on T4 (free Colab) ๐Ÿคฉ

> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352

Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper ๐ŸŽฉ

Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve ๐Ÿฅฒโ˜น๏ธ



D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate ๐Ÿคฉ

Another core idea behind this model is Global Optimal Localization Self-Distillation โคต๏ธ

this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.

  • 2 replies
ยท
merveย 
posted an update about 1 month ago