|
--- |
|
license: apache-2.0 |
|
tags: |
|
- lerobot |
|
- robotics |
|
- vision-language-model |
|
--- |
|
|
|
# Infatoshi/smolvla |
|
|
|
This repository contains a `smolvla_base` policy trained with the [`lerobot`](https://github.com/huggingface/lerobot) framework. |
|
|
|
## Model Description |
|
|
|
This model is a Vision-Language-Action (VLA) policy that can take visual observations, proprioceptive states, and a language instruction to predict robot actions. |
|
|
|
- **Policy Type:** `smolvla` |
|
- **Dataset:** `gribok201/smolvla_koch4` |
|
- **VLM Backbone:** `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` |
|
- **Trained Steps:** `10000` |
|
|
|
### I/O Schema |
|
|
|
**Input Features:** |
|
- `observation.image`: type `VISUAL`, shape `[3, 256, 256]` |
|
- `observation.image2`: type `VISUAL`, shape `[3, 256, 256]` |
|
- `observation.image3`: type `VISUAL`, shape `[3, 256, 256]` |
|
- `observation.state`: type `STATE`, shape `[6]` |
|
|
|
**Output Features:** |
|
- `action`: type `ACTION`, shape `[6]` |
|
|
|
**Image Preprocessing:** |
|
Images are expected to be resized to `[512, 512]` before being passed to the model. |
|
|
|
## How to Use |
|
|
|
This model can be loaded using `transformers.AutoModel` with `trust_remote_code=True`. |
|
**You MUST have `lerobot` installed in your environment for this to work.** |
|
(`pip install lerobot`) |
|
|
|
```python |
|
from transformers import AutoModel |
|
import torch |
|
from PIL import Image |
|
import torchvision.transforms as T |
|
|
|
# Replace with your model's repo_id |
|
repo_id = "Infatoshi/smolvla" |
|
|
|
# Load the model - CRITICAL: trust_remote_code=True |
|
# This executes the custom code in modeling_lerobot_policy.py |
|
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) |
|
model.eval() |
|
|
|
print("Model loaded successfully!") |
|
|
|
# Example Inference: |
|
# Create dummy inputs matching the model's expected schema. |
|
resize_shape = tuple(model.config.resize_imgs_with_padding) |
|
state_shape = tuple(model.config.input_features["observation.state"]["shape"]) |
|
|
|
# Dummy observations dictionary |
|
dummy_observations = { |
|
"state": torch.randn(1, *state_shape), |
|
"images": { |
|
"usb": torch.randn(1, 3, *resize_shape), |
|
"brio": torch.randn(1, 3, *resize_shape), |
|
} |
|
} |
|
dummy_language_instruction = "pick up the cube" |
|
|
|
with torch.no_grad(): |
|
output = model( |
|
observations=dummy_observations, |
|
language_instruction=dummy_language_instruction |
|
) |
|
|
|
print("Inference output (predicted actions):", output) |
|
print("Output shape:", output.shape) |
|
``` |