Activity Feed

AI & ML interests

Request to join this organization to beta-test notebooks on Hugging Face!

Recent Activity

merveΒ 
posted an update about 19 hours ago
view post
Post
502
past week had huuuge releases πŸ’—
here's our picks πŸ”₯ find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🀯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode πŸ’­ as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA
merveΒ 
posted an update 6 days ago
view post
Post
3000
GitHub refuses to render notebooks for a long time now πŸ’”

so smol-vision now lives in Hugging Face model repository πŸ€— merve/smol-vision
  • 1 reply
Β·
merveΒ 
posted an update 7 days ago
view post
Post
3355
ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source πŸ‘ ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🀯
chansungΒ 
posted an update 7 days ago
view post
Post
3428
YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).

Here, I built a simple editor first for @dstackai , and I will share the live endpoint this week. Let me know what you think about this approach.

Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.
merveΒ 
posted an update 8 days ago
view post
Post
3616
Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫑
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
πŸ§‘πŸ»β€πŸ’» apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
πŸ—£οΈ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
πŸ‘€ aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
πŸ“– racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset
  • 1 reply
Β·
merveΒ 
posted an update 13 days ago
view post
Post
908
SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week πŸ€—

> ByteDance/XVerse is a new identity preserving image generation model πŸ–ΌοΈ
> google/gemma-3n-E4B-it, any-to-text model supported by transformers πŸ€—
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers πŸ“‘
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c
merveΒ 
posted an update 13 days ago
merveΒ 
posted an update 15 days ago
merveΒ 
posted an update 19 days ago
view post
Post
593
Dataset Viewer for PDFs just landed on Hugging Face πŸ“–πŸ€— you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker πŸ’¨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🀝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending πŸ“–
  • 1 reply
Β·
merveΒ 
posted an update 21 days ago
view post
Post
639
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector πŸ™πŸ»

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ‡️
  • 1 reply
Β·
merveΒ 
posted an update 22 days ago
view post
Post
4332
Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

πŸ–ΌοΈ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos πŸ‘ (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

πŸ’¬ LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

πŸ—£οΈ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
merveΒ 
posted an update 23 days ago
merveΒ 
posted an update 25 days ago
merveΒ 
posted an update 26 days ago
view post
Post
1920
stop using VLMs blindly βœ‹πŸ»

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) πŸ”₯ visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add 🫑

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for πŸ“‘)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks πŸ—£οΈ
  • 2 replies
Β·
merveΒ 
posted an update 27 days ago
view post
Post
3625
Releases of the past week are here merve/releases-june-13-6852c3c1eaf1e0c24c958860

Here's our picks πŸ€“
So many interesting models released past week in open AI! πŸ€–

πŸ–ΌοΈ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)

Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B 🀯) audio language model that takes in audio and generates audio (OS)

πŸ€– Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model

3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts
merveΒ 
posted an update 28 days ago
view post
Post
3559
IN: video fine-tuning support for facebook V-JEPA 2 in HF transformers πŸ”₯

it comes with
> four models fine-tuned on Diving48 and SSv2 dataset facebook/v-jepa-2-6841bad8413014e185b497a6
> FastRTC demo on V-JEPA2 SSv2 qubvel-hf/vjepa2-streaming-video-classification
> fine-tuning script on UCF-101 https://gist.github.com/ariG23498/28bccc737c11d1692f6d0ad2a0d7cddb
> fine-tuning notebook on UCF-101 https://colab.research.google.com/drive/16NWUReXTJBRhsN3umqznX4yoZt2I7VGc?usp=sharing
we're looking forward to see what you will build! πŸ€—
merveΒ 
posted an update 29 days ago
view post
Post
2457
#CVPR2025 Paper Picks #1
VisionZip is a compression technique that reduces number of visual tokens to improve performance AND prefill time for vision language models
demo: Senqiao/VisionZip
paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models (2412.04467)
most of the image tokens are redundant for the LLM, so the authors ask "are all visual tokens necessary?"

the method is simple:
find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both

their method is both training-free and for fine-tuning
the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B 🀯

removing redundant tokens improve image token quality too πŸ₯Ή
merveΒ 
posted an update 29 days ago
view post
Post
3696
stop writing CUDA kernels yourself

we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face πŸ”₯ use them right away!
it's where the community populates optimized kernels 🀝

this release comes in three parts
> Kernel Hub: contains (as of now) 14 kernels
> kernels: Python library to load kernels from Kernel Hub
> kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)

when building models, your regular workflow should be pulling kernels from Hub and building your model with them πŸ€—
here's a practical example with RMSNorm:
1. pull the kernel from Hub with get_kernel
2. decorate with use_kernel_forward_from_hub
3. inject it to your model
we'd love to hear your feedback! πŸ™πŸ»
we also welcome kernel contributions by community πŸ₯ΉπŸ’—

- request kernels here: kernels-community/README#1
- check out this org: kernels-community
- read the blog: https://huggingface.co/blog/hello-hf-kernels
  • 1 reply
Β·
merveΒ 
posted an update about 1 month ago
view post
Post
725
Dolphin: new OCR model by ByteDance with MIT license 🐬

the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation
Model: ByteDance/Dolphin
Try the demo: ByteDance/Dolphin
reach-vbΒ 
posted an update about 1 month ago
view post
Post
2806
Excited to onboard FeatherlessAI on Hugging Face as an Inference Provider - they bring a fleet of 6,700+ LLMs on-demand on the Hugging Face Hub 🀯

Starting today, you'd be able to access all those LLMs (OpenAI compatible) on HF model pages and via OpenAI client libraries too! πŸ’₯

Go, play with it today: https://huggingface.co/blog/inference-providers-featherless

P.S. They're also bringing on more GPUs to support all your concurrent requests!