smolvla / README.md
reach-vb's picture
reach-vb HF Staff
Add Robotics tag and metadata
8619b80 verified
|
raw
history blame
2.37 kB
---
license: apache-2.0
tags:
- lerobot
- robotics
- vision-language-model
---
# Infatoshi/smolvla
This repository contains a `smolvla_base` policy trained with the [`lerobot`](https://github.com/huggingface/lerobot) framework.
## Model Description
This model is a Vision-Language-Action (VLA) policy that can take visual observations, proprioceptive states, and a language instruction to predict robot actions.
- **Policy Type:** `smolvla`
- **Dataset:** `gribok201/smolvla_koch4`
- **VLM Backbone:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
- **Trained Steps:** `10000`
### I/O Schema
**Input Features:**
- `observation.image`: type `VISUAL`, shape `[3, 256, 256]`
- `observation.image2`: type `VISUAL`, shape `[3, 256, 256]`
- `observation.image3`: type `VISUAL`, shape `[3, 256, 256]`
- `observation.state`: type `STATE`, shape `[6]`
**Output Features:**
- `action`: type `ACTION`, shape `[6]`
**Image Preprocessing:**
Images are expected to be resized to `[512, 512]` before being passed to the model.
## How to Use
This model can be loaded using `transformers.AutoModel` with `trust_remote_code=True`.
**You MUST have `lerobot` installed in your environment for this to work.**
(`pip install lerobot`)
```python
from transformers import AutoModel
import torch
from PIL import Image
import torchvision.transforms as T
# Replace with your model's repo_id
repo_id = "Infatoshi/smolvla"
# Load the model - CRITICAL: trust_remote_code=True
# This executes the custom code in modeling_lerobot_policy.py
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
print("Model loaded successfully!")
# Example Inference:
# Create dummy inputs matching the model's expected schema.
resize_shape = tuple(model.config.resize_imgs_with_padding)
state_shape = tuple(model.config.input_features["observation.state"]["shape"])
# Dummy observations dictionary
dummy_observations = {
"state": torch.randn(1, *state_shape),
"images": {
"usb": torch.randn(1, 3, *resize_shape),
"brio": torch.randn(1, 3, *resize_shape),
}
}
dummy_language_instruction = "pick up the cube"
with torch.no_grad():
output = model(
observations=dummy_observations,
language_instruction=dummy_language_instruction
)
print("Inference output (predicted actions):", output)
print("Output shape:", output.shape)
```