SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

Published June 3, 2025

Update on GitHub

Upvote

202

Aritra Roy Gosthipaty

🧭TL;DR

📚 Table of Contents

Introduction

Meet SmolVLA!

🚀 How to Use SmolVLA?
Install

Finetune the pretrained model

Train from scratch

Method
Main Architecture

Design Choices for Efficiency and Robustness

Asynchronous Inference

Community Datasets
Improving Task Annotations

Standardizing Camera Views

Results

Conclusion

Call to Action:

🧭TL;DR

Today, we introduce SmolVLA, a compact (450M), open-source Vision-Language-Action model for robotics that runs on consumer hardware.

Pretrained only on compatibly licensed, open-source community-shared datasets under the lerobot tag.
SmolVLA-450M outperforms much larger VLAs and strong baselines such as ACT on simulation (LIBERO, Meta-World) and real-world tasks (SO100, SO101).
Supports asynchronous inference for 30% faster response and 2× task throughput.

Useful links:

Hardware used to train and evaluate SO-100/101: https://github.com/TheRobotStudio/SO-ARM100
Base model https://huggingface.co/lerobot/smolvla_base
Paper: https://huggingface.co/papers/2506.01844

📚 Table of Contents

🧭 TL;DR
📖 Introduction
🤖 Meet SmolVLA
🚀 How to Use SmolVLA?
🧠 Method
📦 Community Datasets
- Improving Task Annotations
- Standardizing Camera Views
📊 Results
✅ Conclusion
📣 Call to Action

Introduction

Over the past few years, Transformers have driven remarkable progress in AI, from language models capable of human-like reasoning to multimodal systems that understand both images and text. However, in real-world robotics, advancements have been much slower. Robots still struggle to generalize across diverse objects, environments, and tasks. This limited progress stems from a lack of high-quality, diverse data and the absence of models that can reason and act like humans in the physical world.

In response to these challenges, the field has recently turned to vision-language-action (VLA) models, which aim to unify perception, language understanding, and action prediction within a single architecture. VLAs typically take as input raw visual observations and natural language instructions, and output corresponding robot actions. While promising, much of the recent progress in VLAs remains locked behind proprietary models trained on large-scale private datasets, often requiring costly hardware setups and extensive engineering resources. As a result, the broader robotics research community faces significant barriers in reproducing and building upon these models.

SmolVLA addresses this gap by offering an open-source, compact, and efficient VLA model that can be trained on consumer-grade hardware using only publicly available datasets. By releasing not only model weights but also using very affordable open-source hardware, SmolVLA aims to democratize access to vision-language-action models and accelerate research toward generalist robotic agents.

Figure 1: Comparison of SmolVLA across task variations. From left to right: (1) asynchronous pick-place cube counting, (2) synchronous pick-place cube counting, (3) pick-place cube counting under perturbations, and (4) generalization on pick-and-place of the lego block with real-world SO101.

Meet SmolVLA!

SmolVLA-450M is our open-source, compact yet capable VLA model. It is:

Small enough to run on CPU, train on a single consumer GPU, or even a MacBook!
Trained on public, community-shared robotics data
Released with full training and inference recipes
Can be tested and deployed on very affordable hardware (SO-100, SO-101, LeKiwi, etc.)

Inspired by the training paradigms of Large Language Models (LLMs), SmolVLA goes through a pretraining phase on general manipulation data, followed by task-specific post-training. Architecturally, it combines Transformers with flow-matching decoders, and is optimized for speed and low-latency inference with the following design choices:

Skipping half of the layers of the vision model for faster inference and smaller size
Interleaving self-attention and cross-attention blocks
Using fewer visual tokens
Leveraging smaller pretrained VLMs

Despite using fewer than 30k training episodes—an order of magnitude less than other VLAs—SmolVLA matches or exceeds the performance of much larger models, both in simulation and the real world.

To make real-time robotics easier to use, we introduce an asynchronous inference stack. This technology separates how robots perform actions from how they understand what they see and hear. Because of this separation, robots can respond more quickly in fast-changing environments.

SmolVLA architecture.
Figure 2. SmolVLA takes as input a sequence of RGB images from multiple cameras, the robot’s current sensorimotor state, and a natural language instruction. The VLM encodes these into contextual features, which condition the action expert to generate a continuous sequence of actions.

🚀 How to Use SmolVLA?

SmolVLA is designed to be easy to use and integrate—whether you're finetuning on your own data or plugging it into an existing robotics stack.

Install

First, install the required dependencies:

git clone https://github.com/huggingface/lerobot.git
cd lerobot
pip install -e ".[smolvla]"

Finetune the pretrained model

Use smolvla_base, our pretrained 450M model, with the lerobot training framework:

python lerobot/scripts/train.py \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=lerobot/svla_so100_stacking \
  --batch_size=64 \
  --steps=20000  # 10% of training budget

Train from scratch

If you'd like to build from the architecture (pretrained VLM + action expert) rather than a pretrained checkpoint:

python lerobot/scripts/train.py \
  --policy.type=smolvla \
  --dataset.repo_id=lerobot/svla_so100_stacking \
  --batch_size=64 \
  --steps=200000

You can also load SmolVLAPolicy directly:

from lerobot.common.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained("lerobot/smolvla_base")

Method

SmolVLA is not only a lightweight yet capable model, but also a method for training and evaluating generalist robotics policies. In this section, we introduce the model architecture behind SmolVLA and the asynchronous inference setup used for evaluation, which has proven to be more adaptable and capable of faster recovery.

SmolVLA consists of two core components: a Vision-Language Model (VLM) that processes multimodal inputs and an action expert that outputs robot control commands. Below, we share the details of the main components of SmolVLA architecture and the Asynchronous Inference. More details can be found in our technical report.

Main Architecture

Vision-Language Model (VLM)

We use SmolVLM2 as our VLM backbone. It’s optimized for multi-image inputs and consists of a SigLIP vision encoder and a SmolLM2 language decoder.

Image tokens are extracted via the vision encoder
Language instructions are tokenized and fed directly into the decoder.
Sensorimotor states are projected into a single token using a linear layer to align with the token dimension of the language model.

The decoder layers process concatenated image, language, and state tokens. The resulting features are then passed to the action expert.

Action Expert: Flow Matching Transformer

SmolVLA’s action expert is a compact transformer (~100M parameters) that generates action chunks, i.e. sequences of future robot actions, conditioned on the VLM’s outputs. It is trained using a flow matching objective, which teaches the model to guide noisy samples back to the ground truth. In contrast, while discrete action representations (e.g., via tokenization) are powerful, they often require autoregressive decoding, which is slow and inefficient at inference time. While flow matching allows direct, non-autoregressive prediction of continuous actions, enabling real-time control with high precision.

More intuitively, during training, we add random noise to the robot’s real action sequences and ask the model to predict the “correction vector” that brings them back to the correct trajectory. This forms a smooth vector field over the action space, helping the model learn accurate and stable control policies.

We implement this using a transformer architecture with interleaved attention blocks (see the figure 2), and reduce its hidden size to 75% of the VLM’s, keeping the model lightweight for deployment.

Design Choices for Efficiency and Robustness

While combining a vision-language model with an action prediction module is a common design pattern in recent VLA systems—such as Pi0, GR00T, Diffusion Policy — we identified several architectural choices that significantly enhance the robustness and performance. In SmolVLA, we apply three key techniques: reducing the number of visual tokens, skipping upper layers in the VLM, and interleaving cross- and self-attention layers in the action expert.

Visual Token Reduction

High-resolution images improve perception but can significantly slow down inference. To strike a balance, SmolVLA limits the number of visual tokens to 64 per frame during both training and inference. For example, a 512×512 image is compressed into just 64 tokens, instead of 1024, using PixelShuffle as an efficient shuffling technique. While the underlying Vision-Language Model (VLM) was originally pretrained using image tiling for broader coverage, SmolVLA uses only the global image at runtime to keep inference lightweight and fast.

Faster Inference via Layer Skipping

Rather than always relying on the final layer of the VLM—which can be expensive and sometimes suboptimal—we use features from intermediate layers. Prior work has shown that early layers often provide better representations for downstream tasks. In SmolVLA, the action expert only attends to VLM features up to a configurable layer NN during training, set to half the total layers. This halves the compute cost of both the VLM and the action expert, significantly speeding up inference with minimal performance loss.

Interleaved Cross and Self-Attention

Inside the action expert, attention layers alternate between:

Cross-attention (CA), where action tokens attend to the VLM’s features
Self-attention (SA), where action tokens attend to each other (causally—only to the past)

We found that this interleaved design is both lighter and more effective than using full attention blocks. Models that rely only on CA or only on SA tend to sacrifice either smoothness or grounding.

In SmolVLA, CA ensures that actions are well-conditioned on perception and instructions, while SA improves temporal smoothness—especially critical for real-world control, where jittery predictions can result in unsafe or unstable behavior.

Asynchronous Inference

Figure 3. Asynchronous inference. Illustration of the asynchronous inference stack. Note that the policy can be run on a remote server, possibly with GPUs.

Modern visuomotor policies output action chunks—sequences of actions to execute. There are two ways to manage them:

Synchronous (sync): The robot executes a chunk, then pauses while the next one is computed. Simple, but causes a delay where the robot can't react to new inputs.
Asynchronous (async): While executing the current chunk, the robot already sends the latest observation to a Policy Server (possibly hosted on GPU) for the next chunk. This avoids idle time and improves reactivity.

Our async stack decouples action execution from chunk prediction, resulting in higher adaptability, and the complete lack of execution lags at runtime. It relies on the following key mechanisms:

1. Early trigger: When the queue length falls below a threshold (e.g., 70%), we send an observation to a Policy Server, calling for a new action chunk.
2. Decoupled threads: Control loop keeps executing → inference happens in parallel (non-blocking).
3. Chunk fusion: Overlapping actions from successive chunks are stitched with a simple merge rule to avoid jitter.

We are really excited about releasing asynchronous inference because it guarantees greater adaptability and improved performance without changing the model. In short, async inference keeps the robot responsive by overlapping execution and remote prediction.

Community Datasets

While vision and language models thrive on web-scale datasets like LAION, ImageNet, and Common Crawl, robotics lacks a comparable resource. There’s no “Internet of robots.” Instead, data is fragmented across robot types, sensors, control schemes, and formats—forming disconnected "data islands". In our previous post, we explored how this fragmentation could be resolved through open, collaborative efforts. Just as ImageNet catalyzed breakthroughs in computer vision by providing a large, diverse benchmark, we believe that community-driven robotics datasets can play the same foundational role for generalist robot policies.

SmolVLA is our first step toward that vision: It is pretrained on a curated mix of publicly available, community-contributed datasets designed to reflect real-world variation. Rather than optimizing for dataset size alone, we focus on diversity: a range of behaviors, camera viewpoints, and embodiments that promote transfer and generalization.

All training data used in SmolVLA comes from LeRobot Community Datasets , robotics datasets shared on the Hugging Face Hub under the lerobot tag. Collected in diverse settings, from labs to living rooms, these datasets represent an open, decentralized effort to scale real-world robot data.

Figure 4. A glimpse of the community dataset. Special thanks to Ville Kuosmanen for creating the visualization. Unlike academic benchmarks, community datasets naturally capture messy, realistic interactions: varied lighting, suboptimal demonstrations, unconventional objects, and heterogeneous control schemes. This kind of diversity will be very useful for learning robust, general-purpose representations.

We used a customfiltering tool created by Alexandre Chapin and Ville Kuosmanen to select datasets based on frame count, visual quality, and task coverage. After a meticulous manual review (special thanks to Marina Barannikov), we curated a collection of 487 high-quality datasets focused on the SO100 robotic arm, standardized at 30 FPS. This yielded around 10 million frames—at least one order of magnitude smaller than other popular benchmark datasets, yet significantly more diverse.

Improving Task Annotations

A common issue across community datasets was noisy or missing task descriptions. Many episodes lacked annotations or included vague labels like “task desc” or “Move”, “Pick”. To improve quality and standardize the textual input across datasets, we used Qwen2.5-VL-3B-Instruct to generate concise, action-oriented descriptions.

Given sample frames and the original label, the model was prompted to rewrite the instruction in under 30 characters, starting with an action verb (e.g., “Pick,” “Place,” “Open”).

The prompt used is:

Here is a current task description: {current_task}. Generate a very short, clear, and complete one-sentence describing the action performed by the robot arm (max 30 characters). Do not include unnecessary words.
Be concise.
Here are some examples: Pick up the cube and place it in the box, open the drawer and so on.
Start directly with an action verb like “Pick”, “Place”, “Open”, etc.
Similar to the provided examples, what is the main action done by the robot arm?

Standardizing Camera Views

Another challenge was inconsistent camera naming. Some datasets used clear names like top or wrist.right, while others used ambiguous labels like images.laptop, which varied in meaning. To fix this, we manually went through the datasets and mapped each camera view to a standardized scheme: OBS_IMAGE_1: Top-down view OBS_IMAGE_2: Wrist-mounted view OBS_IMAGE_3+: Additional viewpoints

We further isolate the contributions of community dataset pretraining and multitask finetuning. Without pretraining on the LeRobot community datasets, SmolVLA initially achieves 51.7% success on SO100. After pretraining on community-collected data, performance jumps to 78.3%, a +26.6% absolute improvement. Multitask finetuning further boosts performance, showing strong task transfer capabilities even in low-data regimes.

Table 1. Impact of Pretraining on Community Datasets and Multitask Finetuning.

Results

We evaluate SmolVLA across simulation and real-world benchmarks to test its generalization, efficiency, and robustness. Despite being compact, It consistently outperforms or matches the performance of significantly larger models and policies pretrained on higher-scale robotics data.

Table 2. SmolVLA Performance on Simulation Benchmarks.

Table 3. SmolVLA vs Baselines on Real-World Tasks (SO100).

In real-world settings, SmolVLA is evaluated on two diverse suites: SO100 and SO101. These tasks include pick-place, stacking, and sorting, with both in-distribution and out-of-distribution object configurations. On SO101, SmolVLA also excels in generalization:

Table 4. Generalization of SmolVLA to New Embodiment (SO101) vs ACT..

Finally, we evaluate SmolVLA under synchronous and asynchronous inference modes. Async inference decouples action execution from model inference, allowing the policy to react while the robot is moving.

Both modes achieve similar task success (≈78%), but async inference:
- Completes tasks ~30% faster (9.7s vs. 13.75s)
- Enables 2× more completions in fixed-time settings (19 vs. 9 cubes)

This results in more responsive and robust real-world performance, especially in dynamic environments with shifting objects or external disturbances.

Figure 5. Asynchronous vs. Synchronous Inference in Real-World Tasks. (a) Task success rates (%), (b) average completion time(s), and (c) number of tasks completed within a fixed time window.

Conclusion

SmolVLA is our contribution to building robotics foundation models that are open, efficient, and reproducible. Despite its small size, it matches or outperforms larger, proprietary models across a range of real-world and simulated tasks. By relying solely on community-contributed datasets and affordable hardware, SmolVLA lowers the barrier to entry for researchers, educators, and hobbyists alike. But this is just the beginning. SmolVLA is more than just a model — it's part of a growing open-source movement toward scalable, collaborative robotics.

Call to Action:

🔧 Try it out! Finetune SmolVLA on your own data, deploy it on affordable hardware, or benchmark it against your current stack and share it on twitter/linkedin.
🤖 Upload the dataset! Got a robot? Collect and share your data using the lerobot format. Help expand the community dataset that powers SmolVLA.
💬 Join the blog discussion. Drop your questions, ideas, or feedback in the discussion below. We’re happy to help with integration, training, or deployment.
📊 Contribute. Improve datasets, report issues, suggest new ideas. Every contribution helps.
🌍 Spread the word. Share SmolVLA with fellow researchers, developers, or educators interested in efficient, real-time robotic policies.
📫 Stay in touch: Follow the LeRobot organization and Discord server for updates, tutorials, and new releases.

Together, we can make real-world robotics more capable, more affordable, and more open. ✨

Asynchronous Robot Inference: Decoupling Action Prediction and Execution

By July 10, 2025 • 31

Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders

By July 9, 2025 • 572

Community

clem

Jun 3

🔥🔥🔥

Qu3tzal

Jun 4

•

edited Jun 4

How does the asynchronous inference mechanism works if the state keeps changing between the observation and the action generation?

cybersquirrel

Jun 4

I'd imagine, similar to an agentic loop controlling a web browser. Observe to determine what action to take, take the action, and observe any visual changes to decide what to do next. Repeats until the goal is achieved.

dopaul

Jun 4

Super interesting work!

PLB

Jun 4

👏

ZhaoRunyi

Jun 8

Is the entire pretraining dataset collected with the SO100 robot arm? I've briefly checked the datasets listed at the end of the paper, seems that all of them are collected by the SO100 robot arm.

If so, after finetuning, your model can works on another robot arm like Franka in LIBERO, that would be pretty interesting

yuliangguo

10 days ago

If not misunderstood, SmolVLA claimed that purely SO100 or SO101 trained model can directly apply to a different arm in LIBERO? This seem really hard to believe, hope any confirmation from the authors.

Heimrih

Jun 9

Love this work! I was attempting to load the policy and use it on the pushT gym. However, as it is out of the box it does not seem to be understand what i needs to do and shot straight to the side. Is this normal? I am not an expert in robotics or VLA yet but I am interested to try.

scravenge

Jun 13

I had translated my mobile aloha dataset into lerobot' s format for using it in Pi0, can I use it for this project?

Tonic

Jun 15

This comment has been hidden (marked as Resolved)

olivierkoch

4 days ago

Great work! If we want to use this model on an SO-101 arm, I assume we need to do some fine-tuning since each arm has its own calibration? (i.e. the model can't be used as-is, right?)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

202