---
license: apache-2.0
tags:
- lerobot
- robotics
- vision-language-model
---

# Infatoshi/smolvla

This repository contains a `smolvla_base` policy trained with the [`lerobot`](https://github.com/huggingface/lerobot) framework.

## Model Description

This model is a Vision-Language-Action (VLA) policy that can take visual observations, proprioceptive states, and a language instruction to predict robot actions.

- **Policy Type:** `smolvla`
- **Dataset:** `gribok201/smolvla_koch4`
- **VLM Backbone:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
- **Trained Steps:** `10000`

### I/O Schema

**Input Features:**
- `observation.image`: type `VISUAL`, shape `[3, 256, 256]`
- `observation.image2`: type `VISUAL`, shape `[3, 256, 256]`
- `observation.image3`: type `VISUAL`, shape `[3, 256, 256]`
- `observation.state`: type `STATE`, shape `[6]`

**Output Features:**
- `action`: type `ACTION`, shape `[6]`

**Image Preprocessing:**
Images are expected to be resized to `[512, 512]` before being passed to the model.

## How to Use

This model can be loaded using `transformers.AutoModel` with `trust_remote_code=True`.
**You MUST have `lerobot` installed in your environment for this to work.**
(`pip install lerobot`)

```python
from transformers import AutoModel
import torch
from PIL import Image
import torchvision.transforms as T

# Replace with your model's repo_id
repo_id = "Infatoshi/smolvla"

# Load the model - CRITICAL: trust_remote_code=True
# This executes the custom code in modeling_lerobot_policy.py
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

print("Model loaded successfully!")

# Example Inference:
# Create dummy inputs matching the model's expected schema.
resize_shape = tuple(model.config.resize_imgs_with_padding)
state_shape = tuple(model.config.input_features["observation.state"]["shape"])

# Dummy observations dictionary
dummy_observations = {
    "state": torch.randn(1, *state_shape),
    "images": {
        "usb": torch.randn(1, 3, *resize_shape),
        "brio": torch.randn(1, 3, *resize_shape),
    }
}
dummy_language_instruction = "pick up the cube"

with torch.no_grad():
    output = model(
        observations=dummy_observations,
        language_instruction=dummy_language_instruction
    )

print("Inference output (predicted actions):", output)
print("Output shape:", output.shape)
```