smolagents (smolagents)

merve

posted an update about 19 hours ago

Post

502

past week had huuuge releases 💗
here's our picks 🔥 find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🤯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode 💭 as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA

albertvillanova

posted an update 4 days ago

Post

266

🚀 New in smolagents v1.20.0: Remote Python Execution via WebAssembly (Wasm)

We've just merged a major new capability into the smolagents framework: the CodeAgent can now execute Python code remotely in a secure, sandboxed WebAssembly environment!

🔧 Powered by Pyodide and Deno, this new WasmExecutor lets your agent-generated Python code run safely: without relying on Docker or local execution.

Why this matters:
✅ Isolated execution = no host access
✅ No need for Python on the user's machine
✅ Safer evaluation of arbitrary code
✅ Compatible with serverless / edge agent workloads
✅ Ideal for constrained or untrusted environments

This is just the beginning: a focused initial implementation with known limitations. A solid MVP designed for secure, sandboxed use cases. 💡

💡 We're inviting the open-source community to help evolve this executor:
• Tackle more advanced Python features
• Expand compatibility
• Add test coverage
• Shape the next-gen secure agent runtime

🔗 Check out the PR: https://github.com/huggingface/smolagents/pull/1261

Let's reimagine what agent-driven Python execution can look like: remote-first, wasm-secure, and community-built.

This feature is live in smolagents v1.20.0!
Try it out.
Break things. Extend it. Give us feedback.
Let's build safer, smarter agents; together 🧠⚙️

👉 https://github.com/huggingface/smolagents/releases/tag/v1.20.0

#smolagents #WebAssembly #Python #AIagents #Pyodide #Deno #OpenSource #HuggingFace #AgenticAI

merve

posted an update 6 days ago

Post

3000

GitHub refuses to render notebooks for a long time now 💔

so smol-vision now lives in Hugging Face model repository 🤗 merve/smol-vision

1 reply

·

merve

posted an update 7 days ago

Post

3355

ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯

merve

posted an update 8 days ago

Post

3616

Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫡
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
🧑🏻‍💻 apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
🗣️ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
👀 aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
📖 racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset

1 reply

·

merve

posted an update 13 days ago

Post

908

SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week 🤗

> ByteDance/XVerse is a new identity preserving image generation model 🖼️
> google/gemma-3n-E4B-it, any-to-text model supported by transformers 🤗
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers 📑
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c

merve

posted an update 13 days ago

Post

2990

visual reasoning is now in transformers 🔥
THUDM/GLM-4.1V-9B-Thinking is just released and merged into transformers, we gave it a vibe test run 🤠

it's very good, comes with 64k context length and MIT license 😍
it supports 4k image tokens and any aspect ratio as well!
Notebook: http://colab.research.google.com/drive/1atODIiV57hOZLv16Bjzwd6fwx0yoTorj?usp=sharing
Demo: THUDM/GLM-4.1V-9B-Thinking-Demo

merve

posted an update 15 days ago

Post

2513

so many multimodal releases these days 🤠
> ERNIE-4.5-VL: new vision language MoE models by Baidu https://huggingface.co/models?search=ernie-4.5-vl
> new visual document retrievers by NVIDIA (sota on ViDoRe!) nvidia/llama-nemoretriever-colembed-3b-v1 nvidia/llama-nemoretriever-colembed-1b-v1
> Ovis-3b: new image-text in image-text out models by Alibaba ⤵️ https://huggingface.co/spaces/AIDC-AI/Ovis-U1-

thomwolf

authored a paper 18 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 19 days ago • 61

lvwerra

authored a paper 18 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 19 days ago • 61

merve

posted an update 19 days ago

Post

593

Dataset Viewer for PDFs just landed on Hugging Face 📖🤗 you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker 💨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🤝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending 📖

1 reply

·

freddyaboulton

posted an update 20 days ago

Post

3391

The new multimodalart/self-forcing model and demo are truly impressive!

albertvillanova

posted an update 21 days ago

Post

1583

🚀 SmolAgents v1.19.0 is live!
This release brings major improvements to agent flexibility, UI usability, streaming architecture, and developer experience: making it easier than ever to build smart, interactive AI agents. Here's what's new:

🔧 Agent Upgrades
- Support for managed agents in ToolCallingAgent
- Context manager support for cleaner agent lifecycle handling
- Output formatting now uses XML tags for consistency

🖥️ UI Enhancements
- GradioUI now supports reset_agent_memory: perfect for fresh starts in dev & demos.

🔄 Streaming Refactor
- Streaming event aggregation moved off the Model class
- ➡️ Better architecture & maintainability

📦 Output Tracking
- CodeAgent outputs are now stored in ActionStep
- ✅ More visibility and structure to agent decisions

🐛 Bug Fixes
- Smarter planning logic
- Cleaner Docker logs
- Better prompt formatting for additional_args
- Safer internal functions and final answer matching

📚 Docs Improvements
- Added quickstart examples with tool usage
- One-click Colab launch buttons
- Expanded reference docs (AgentMemory, GradioUI docstrings)
- Fixed broken links and migrated to .md format

🔗 Full release notes:
https://github.com/huggingface/smolagents/releases/tag/v1.19.0

💬 Try it out, explore the new features, and let us know what you build!

#smolagents #opensource #AIagents #LLM #HuggingFace

merve

posted an update 21 days ago

Post

639

we've merged LightGlue keypoint matcher to Hugging Face transformers! it allows commercial use when paired with an open-source keypoint detector 🙏🏻

it works very well, try it yourself: ETH-CVG/LightGlue

here's an in-the-wild test with two images of the same place ⤵️

1 reply

·

merve

posted an update 22 days ago

Post

4332

Release picks of the past week is here! Find more models, datasets, Spaces here merve/june-20-releases-68594824d1f4dfa61aee3433

🖼️ VLMs/OCR
> moonshotai/Kimi-VL-A3B-Thinking-2506 is a powerful reasoning vision LM, 3B active params, smarter with less tokens, supports long documents, videos 👏 (OS)
> nanonets/Nanonets-OCR-s is 3.75B params OCR model based on Qwen2.5VL-3B-Instruct (OS)

💬 LLMs
> moonshotai/Kimi-Dev-72B is a strong coding model based on Qwen2.5-72B (OS)
> Mistral released mistralai/Mistral-Small-3.2-24B-Instruct-2506, an update to their former model with better function calling & instruction following (OS)

🗣️ Audio
> Google released google/magenta-realtime, real time music generation & audio synthesis (cc-by-4)
> kyutai released new speech-to-text models that come in 1B & 2B ( kyutai/stt-1b-en_fr, stt-2b-en_fr) with 0.5s and 2.5s delay

3D
> Tencent released tencent/Hunyuan3D-2.1 an image-to-3D model (see below)

merve

posted an update 23 days ago

Post

5033

fav open-source multimodal reasoning model just got an update 🔥

moonshotai/Kimi-VL-A3B-Thinking-2506 has
> smarter with less tokens, small size (only 3B active params!!!)
> better accuracy
> video reasoning
> higher resolution 🤯
Read their blog https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506

merve

posted an update 25 days ago

Post

2297

y'all have been asking my opinion on how OCR models compare to each other 👀
I will leave three apps to compare newest models by @prithivMLmods instead ⤵️
> compare Nanonets-OCR-s, Qwen2-VL-OCR-2B-Instruct, RolmOCR, Aya-Vision prithivMLmods/Multimodal-OCR
> SmolDocling, Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B prithivMLmods/Multimodal-OCR2
> docscopeOCR, MonkeyOCR, coreOCR prithivMLmods/core-OCR

1 reply

·

merve

posted an update 26 days ago

Post

1920

stop using VLMs blindly ✋🏻

compare different VLM outputs on a huge variety of inputs (from reasoning to OCR!) 🔥 visionLMsftw/comparevlms

> has support for multiple VLMs: google/gemma-3-27b-it, Qwen/Qwen2.5-VL-7B-Instruct, Qwen/Qwen2.5-VL-32B-Instruct, meta-llama/Llama-4-Maverick-17B-128E-Instruct, HuggingFaceTB/SmolVLM2-2.2B-Instruct
> recommend us new models or inputs, we'll add 🫡

so far I figured out
> for fact-checks, you need a relatively bigger size (7B is ok!)
> Gemma 3 gets downgrade without pan and scan (especially for 📑)
> Qwen2.5VL-32B is very talkative, great for reasoning but not good for simple tasks 🗣️

2 replies

·

merve

posted an update 27 days ago

Post

3625

Releases of the past week are here merve/releases-june-13-6852c3c1eaf1e0c24c958860

Here's our picks 🤓
So many interesting models released past week in open AI! 🤖

🖼️ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)

Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B 🤯) audio language model that takes in audio and generates audio (OS)

🤖 Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model

3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts

merve

posted an update 28 days ago

Post

3559

IN: video fine-tuning support for

facebook V-JEPA 2 in HF transformers 🔥

it comes with
> four models fine-tuned on Diving48 and SSv2 dataset facebook/v-jepa-2-6841bad8413014e185b497a6
> FastRTC demo on V-JEPA2 SSv2 qubvel-hf/vjepa2-streaming-video-classification
> fine-tuning script on UCF-101 https://gist.github.com/ariG23498/28bccc737c11d1692f6d0ad2a0d7cddb
> fine-tuning notebook on UCF-101 https://colab.research.google.com/drive/16NWUReXTJBRhsN3umqznX4yoZt2I7VGc?usp=sharing
we're looking forward to see what you will build! 🤗

smolagents

AI & ML interests

Recent Activity

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

AI & ML interests

Recent Activity

Team members 10

smolagents's activity