AI & ML interests

None defined yet.

Recent Activity

merveΒ 
posted an update about 15 hours ago
view post
Post
184
we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector πŸ™πŸ»

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ‡️
  • 1 reply
Β·
merveΒ 
posted an update 1 day ago
view post
Post
3021
Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

πŸ–ΌοΈ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos πŸ‘ (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

πŸ’¬ LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

πŸ—£οΈ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)
merveΒ 
posted an update 3 days ago
merveΒ 
posted an update 5 days ago
merveΒ 
posted an update 5 days ago
view post
Post
1828
stop using VLMs blindly βœ‹πŸ»

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) πŸ”₯ visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add 🫑

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for πŸ“‘)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks πŸ—£οΈ
  • 2 replies
Β·
merveΒ 
posted an update 7 days ago
view post
Post
3544
Releases of the past week are here merve/releases-june-13-6852c3c1eaf1e0c24c958860

Here's our picks πŸ€“
So many interesting models released past week in open AI! πŸ€–

πŸ–ΌοΈ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)

Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B 🀯) audio language model that takes in audio and generates audio (OS)

πŸ€– Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model

3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts
merveΒ 
posted an update 8 days ago
view post
Post
3490
IN: video fine-tuning support for facebook V-JEPA 2 in HF transformers πŸ”₯

it comes with
> four models fine-tuned on Diving48 and SSv2 dataset facebook/v-jepa-2-6841bad8413014e185b497a6
> FastRTC demo on V-JEPA2 SSv2 qubvel-hf/vjepa2-streaming-video-classification
> fine-tuning script on UCF-101 https://gist.github.com/ariG23498/28bccc737c11d1692f6d0ad2a0d7cddb
> fine-tuning notebook on UCF-101 https://colab.research.google.com/drive/16NWUReXTJBRhsN3umqznX4yoZt2I7VGc?usp=sharing
we're looking forward to see what you will build! πŸ€—
burtenshawΒ 
posted an update 8 days ago
view post
Post
677
You don't need remote APIs for a coding copliot, or the MCP Course! Set up a fully local IDE with MCP integration using Continue. In this tutorial Continue guides you through setting it up.

This is what you need to do to take control of your copilot:

1. Get the Continue extension from the [VS Code marketplace](https://marketplace.visualstudio.com/items?itemName=Continue.continue) to serve as the AI coding assistant.

2. Serve the model with an OpenAI compatible server in Llama.cpp / LmStudio/ etc.

llama-server -hf unsloth/Devstral-Small-2505-GGUF:Q4_K_M

3. Create a .continue/models/llama-max.yaml file in your project to tell Continue how to use the local Ollama model.
name: Llama.cpp model
    version: 0.0.1
    schema: v1
    models:
      - provider: llama.cpp
        model: unsloth/Devstral-Small-2505-GGUF
        apiBase: http://localhost:8080
        defaultCompletionOptions:
          contextLength: 8192 
    # Adjust based on the model
        name: Llama.cpp Devstral-Small
        roles:
          - chat
          - edit


4. Create a .continue/mcpServers/playwright-mcp.yaml file to integrate a tool, like the Playwright browser automation tool, with your assistant.

name: Playwright mcpServer
    version: 0.0.1
    schema: v1
    mcpServers:
      - name: Browser search
        command: npx
        args:
          - "@playwright/mcp@latest"


Check out the full tutorial in the [the MCP course](https://huggingface.co/learn/mcp-course/unit2/continue-client)
merveΒ 
posted an update 8 days ago
view post
Post
2411
#CVPR2025 Paper Picks #1
VisionZip is a compression technique that reduces number of visual tokens to improve performance AND prefill time for vision language models
demo: Senqiao/VisionZip
paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models (2412.04467)
most of the image tokens are redundant for the LLM, so the authors ask "are all visual tokens necessary?"

the method is simple:
find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both

their method is both training-free and for fine-tuning
the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B 🀯

removing redundant tokens improve image token quality too πŸ₯Ή
merveΒ 
posted an update 9 days ago
view post
Post
3631
stop writing CUDA kernels yourself

we have launched Kernel Hub: easy optimized kernels for all models on Hugging Face πŸ”₯ use them right away!
it's where the community populates optimized kernels 🀝

this release comes in three parts
> Kernel Hub: contains (as of now) 14 kernels
> kernels: Python library to load kernels from Kernel Hub
> kernel-builder: Nix package to build kernels for PyTorch (made using PyTorch C++ frontend)

when building models, your regular workflow should be pulling kernels from Hub and building your model with them πŸ€—
here's a practical example with RMSNorm:
1. pull the kernel from Hub with get_kernel
2. decorate with use_kernel_forward_from_hub
3. inject it to your model
we'd love to hear your feedback! πŸ™πŸ»
we also welcome kernel contributions by community πŸ₯ΉπŸ’—

- request kernels here: kernels-community/README#1
- check out this org: kernels-community
- read the blog: https://huggingface.co/blog/hello-hf-kernels
  • 1 reply
Β·
merveΒ 
posted an update 12 days ago
view post
Post
689
Dolphin: new OCR model by ByteDance with MIT license 🐬

the model first detects element in the layout (table, formula etc) and then parses each element in parallel for generation
Model: ByteDance/Dolphin
Try the demo: ByteDance/Dolphin
burtenshawΒ 
posted an update 13 days ago
view post
Post
1537
Brand new MCP Course has units are out, and now it's getting REAL! We've collaborated with Anthropic to dive deep into production ready and autonomous agents using MCP

πŸ”— mcp-course

This is what the new material covers and includes:

- Use Claude Code to build an autonomous PR agent
- Integrate your agent with Slack and Github to integrate it with you Team
- Get certified on your use case and share with the community
- Build an autonomous PR cleanup agent on the Hugging Face hub and deploy it with spaces

The material goes deep into these problems and helps you to build applications that work. We’re super excited to see what you build with it.
merveΒ 
posted an update 13 days ago
view post
Post
1361
stop building parser pipelines πŸ‘‹πŸ»
there's a new document parser that is small, fast, Apache 2.0 licensed and is better than all the other ones! 😱

echo840/MonkeyOCR is a 3B model that can parse everything (charts, formules, tables etc) in a document 🀠
> the authors show in the paper that document parsing pipelines often have errors propagating back
> using singular e2e models are better but they're too heavy to use

this model addresses both: it's lighter, faster, stronger πŸ”₯
merveΒ 
posted an update 14 days ago
view post
Post
1593
Meta just released V-JEPA 2: new open-source image/video world models β―οΈπŸ€— facebook/v-jepa-2-6841bad8413014e185b497a6

> based on ViT, different sizes (L/G/H) and resolution (286/384)
> 0-day support in πŸ€— transformers
> comes with a physical reasoning (from video) benchmark: MVPBench, IntPhys 2, and CausalVQA facebook/physical_reasoning_leaderboard

Read more https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
We will release a fine-tuning notebook with task-specific models in transformers format soon, stay tuned!
burtenshawΒ 
posted an update 14 days ago
view post
Post
1428
Super excited to release Autotrain MCP. This is an MCP server for training AI models, so you can use your AI tools to train your AI models 🀯.

πŸ”— burtenshaw/autotrain-mcp

Use this MCP server with tools like Claude Desktop, Cursor, VSCode, or Continue to do this:

- Define an ML problem like Image Classification, LLM fine-tuning, Text Classification, etc.
- The AI can retrieve models and datasets from the hub using the hub MCP.
- Training happens on a Hugging Face space, so no worries about hardware restraints.
- Models are pushed to the hub to be used inference tools like Llama.cpp, vLLM, MLX, etc.
- Built on top of the AutoTrain library, so it has full integration with transformers and other libraries.

Everything is still under active development, but I’m super excited to hear what people build, and I’m open to contributions!
  • 1 reply
Β·
merveΒ 
posted an update 20 days ago
ariG23498Β 
posted an update 21 days ago
view post
Post
1438
🚨 Implement KV Cache from scratch in pure PyTorch. 🚨

We have documented all of our learning while implementing KV Cache to nanoVLM. Joint work with @kashif @lusxvr @andito @pcuenq

Blog: hf.co/blog/kv-cache
  • 1 reply
Β·
merveΒ 
posted an update 21 days ago
view post
Post
1520
Past week was insanely packed for open AI! 😱
Luckily we picked some highlights for you ❀️ lfg!

πŸ’¬ LLMs/VLMs
> Deepseek 🐳 released deepseek-ai/DeepSeek-R1-0528, 38B model, only 0.2 and 1.4 points behind o3 in AIME 24/25 🀯 they also released an 8B distilled version based on Qwen3 (OS) deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d
> Xiaomi released MiMo-7B-RL (LLM for code and math) and MiMo-VL-7B-RL (VLM for visual reasoning, GUI agentic task and general use) (OS) 😍 XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212
> NVIDIA released , new reasoning model nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
> DS: MiniMax released https://huggingface.co/MiniMaxAI/SynLogic, new 49k logical reasoning examples across 35 tasks including solving cipher, sudoku and more!

πŸ–ΌοΈ Image/Video Generation
> tencent released tencent/HunyuanPortrait, a new model for consistent portrait generation with SVD Research license. They also released tencent/HunyuanVideo-Avatar, audio driven avatar generation (OS)
> showlab released showlab/OmniConsistency, consistent stylization model (OS)
> Rapidata/text-2-video-human-preferences-veo3 is a new T2V preference dataset based on videos from Veo3 with 46k examples (OS)

AudioπŸ—£οΈ
> https://huggingface.co/ResembleAI/Chatterbox is a new 500M text-to-speech model preferred more than ElevenLabs (OS) 😍
> PlayHT/PlayDiffusion is a new speech editing model (OS)

Other
> https://huggingface.co/NX-AI/TiReX is a new time series foundation model
> Yandex released a huge (4.79B examples!) video recommendation dataset https://huggingface.co/yandex/yambda

OS ones have Apache2.0 or MIT licenses, find more models and datasets here merve/releases-30-may-6840097345e0b1e915bff843
merveΒ 
posted an update 21 days ago
view post
Post
1403
Yesterday was the day of vision language action models (VLAs)!

> SmolVLA: open-source small VLA for robotics by Hugging Face LeRobot team πŸ€–
Blog: https://huggingface.co/blog/smolvla
Model: lerobot/smolvla_base

> Holo-1: 3B & 7B web/computer use agentic VLAs by H Company πŸ’»
Model family: Hcompany/holo1-683dd1eece7eb077b96d0cbd
Demo: https://huggingface.co/spaces/multimodalart/Holo1
Blog: https://huggingface.co/blog/Hcompany/holo1
super exciting times!!
merveΒ 
posted an update 22 days ago