gribok201
/

smolvla

vision-language-model

Model card Files Files and versions

smolvla / README.md

reach-vb's picture

reach-vb HF Staff

Add Robotics tag and metadata

8619b80 verified about 2 months ago

|

2.37 kB

	---
	license: apache-2.0
	tags:
	- lerobot
	- robotics
	- vision-language-model
	---

	# Infatoshi/smolvla

	This repository contains a `smolvla_base` policy trained with the [`lerobot`](https://github.com/huggingface/lerobot) framework.

	## Model Description

	This model is a Vision-Language-Action (VLA) policy that can take visual observations, proprioceptive states, and a language instruction to predict robot actions.

	- Policy Type: `smolvla`
	- Dataset: `gribok201/smolvla_koch4`
	- VLM Backbone: `HuggingFaceTB/SmolVLM2-500M-Video-Instruct`
	- Trained Steps: `10000`

	### I/O Schema

	Input Features:
	- `observation.image`: type `VISUAL`, shape `[3, 256, 256]`
	- `observation.image2`: type `VISUAL`, shape `[3, 256, 256]`
	- `observation.image3`: type `VISUAL`, shape `[3, 256, 256]`
	- `observation.state`: type `STATE`, shape `[6]`

	Output Features:
	- `action`: type `ACTION`, shape `[6]`

	Image Preprocessing:
	Images are expected to be resized to `[512, 512]` before being passed to the model.

	## How to Use

	This model can be loaded using `transformers.AutoModel` with `trust_remote_code=True`.
	You MUST have `lerobot` installed in your environment for this to work.
	(`pip install lerobot`)

	```python
	from transformers import AutoModel
	import torch
	from PIL import Image
	import torchvision.transforms as T

	# Replace with your model's repo_id
	repo_id = "Infatoshi/smolvla"

	# Load the model - CRITICAL: trust_remote_code=True
	# This executes the custom code in modeling_lerobot_policy.py
	model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
	model.eval()

	print("Model loaded successfully!")

	# Example Inference:
	# Create dummy inputs matching the model's expected schema.
	resize_shape = tuple(model.config.resize_imgs_with_padding)
	state_shape = tuple(model.config.input_features["observation.state"]["shape"])

	# Dummy observations dictionary
	dummy_observations = {
	"state": torch.randn(1, *state_shape),
	"images": {
	"usb": torch.randn(1, 3, *resize_shape),
	"brio": torch.randn(1, 3, *resize_shape),
	}
	}
	dummy_language_instruction = "pick up the cube"

	with torch.no_grad():
	output = model(
	observations=dummy_observations,
	language_instruction=dummy_language_instruction
	)

	print("Inference output (predicted actions):", output)
	print("Output shape:", output.shape)
	```