Improve model card: Add abstract, usage, overview, and metadata
Browse filesThis PR significantly enhances the model card for `villa-X` by:
- Adding `library_name: lam` to the metadata for better discoverability and integration.
- Updating the project page URL to `https://aka.ms/villa-x` for consistency with the official project page.
- Including the paper's abstract to provide a concise overview.
- Incorporating a detailed "Overview" section directly from the GitHub repository, explaining the model's key contributions.
- Providing comprehensive usage instructions, including installation steps and concrete Python code examples for inference, allowing users to quickly get started.
- Adding a "Pre-trained Models" table for easy access to available checkpoints.
- Including the official BibTeX citation for proper attribution.
- Adding a "Credits" section to acknowledge contributing open-source projects.
- Ensuring all image links are robust for display on the Hugging Face Hub.
These updates greatly improve the discoverability, clarity, and utility of the model card for researchers and developers.
@@ -1,21 +1,106 @@
|
|
1 |
---
|
2 |
-
license: mit
|
3 |
language:
|
4 |
- en
|
|
|
5 |
pipeline_tag: robotics
|
|
|
6 |
---
|
7 |
|
8 |
<div align="center">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
|
|
|
|
|
|
|
|
10 |
<p align="center">
|
11 |
-
<img src="villa-x
|
12 |
-
|
13 |
</p>
|
14 |
|
15 |
-
|
|
|
|
|
16 |
|
17 |
-
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
-
|
21 |
-
Check out [https://github.com/microsoft/villa-x/](https://github.com/microsoft/villa-x/)
|
|
|
1 |
---
|
|
|
2 |
language:
|
3 |
- en
|
4 |
+
license: mit
|
5 |
pipeline_tag: robotics
|
6 |
+
library_name: lam
|
7 |
---
|
8 |
|
9 |
<div align="center">
|
10 |
+
<p align="center">
|
11 |
+
<img src="https://huggingface.co/microsoft/villa-x/resolve/main/villa-x-transparent.png" width="400"/>
|
12 |
+
</p>
|
13 |
+
|
14 |
+
<h1>villa-X: A Vision-Language-Latent-Action Model</h1>
|
15 |
+
|
16 |
+
[](https://arxiv.org/abs/2507.23682)   [](https://aka.ms/villa-x)   [](https://github.com/microsoft/villa-x/)
|
17 |
+
</div>
|
18 |
+
|
19 |
+
This is the official Hugging Face repository for **villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models**.
|
20 |
|
21 |
+
## Abstract
|
22 |
+
Visual-Language-Action (VLA) models have emerged as a popular paradigm for learning robot manipulation policies that can follow language instructions and generalize to novel scenarios. Recent work has begun to explore the incorporation of latent actions, an abstract representation of visual change between two frames, into VLA pre-training. In this paper, we introduce **villa-X**, a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent action modeling for learning generalizable robot manipulation policies. Our approach improves both how latent actions are learned and how they are incorporated into VLA pre-training. Together, these contributions enable villa-X to achieve superior performance across simulated environments including SIMPLER and LIBERO, as well as on two real-world robot setups including gripper and dexterous hand manipulation. We believe the ViLLA paradigm holds significant promise, and that our villa-X provides a strong foundation for future research.
|
23 |
+
|
24 |
+
## Overview
|
25 |
<p align="center">
|
26 |
+
<img src="https://github.com/microsoft/villa-x/raw/main/assets/overview.png" alt="villa-x overview" width="700"/>
|
|
|
27 |
</p>
|
28 |
|
29 |
+
* We improve latent action learning by introducing an extra proprio FDM, which aligns latent tokens with underlying robot states and actions and grounds them in physical dynamics.
|
30 |
+
* We propose to jointly learn a latent action expert and a robot action expert through joint diffusion in the policy model, conditioning robot action prediction on latent actions to fully exploit their potential.
|
31 |
+
* Our method demonstrates superior performance on simulated environments as well as on real-world robotic tasks. The latent action expert can effectively plan into future with both visual and proprio state planning.
|
32 |
|
33 |
+
## Usage
|
34 |
+
|
35 |
+
### Setup
|
36 |
+
|
37 |
+
1. Clone the repository.
|
38 |
+
|
39 |
+
```bash
|
40 |
+
git clone https://github.com/microsoft/villa-x.git
|
41 |
+
cd villa-x
|
42 |
+
```
|
43 |
+
|
44 |
+
2. Install the required packages.
|
45 |
+
|
46 |
+
```bash
|
47 |
+
sudo apt-get install -y build-essential zlib1g-dev libffi-dev libssl-dev libbz2-dev libreadline-dev libsqlite3-dev liblzma-dev libncurses-dev tk-dev python3-dev ffmpeg -y
|
48 |
+
curl -LsSf https://astral.sh/uv/install.sh | sh # Skip this step if you already have uv installed
|
49 |
+
uv sync
|
50 |
+
```
|
51 |
+
|
52 |
+
### Inference with Pre-trained Latent Action Model
|
53 |
+
|
54 |
+
1. Download the pre-trained models from [Hugging Face](https://huggingface.co/microsoft/villa-x).
|
55 |
+
|
56 |
+
2. Load the latent action model.
|
57 |
+
|
58 |
+
```python
|
59 |
+
from lam import IgorModel
|
60 |
+
|
61 |
+
lam = IgorModel.from_pretrained("LOCAL_MODEL_DIRECTORY").cuda()
|
62 |
+
```
|
63 |
+
|
64 |
+
3. Extract the latent actions from a video.
|
65 |
+
|
66 |
+
```python
|
67 |
+
def read_video(fp: str):
|
68 |
+
from torchvision.io import read_video
|
69 |
+
|
70 |
+
video, *_ = read_video(fp, pts_unit="sec")
|
71 |
+
return video
|
72 |
+
|
73 |
+
video = read_video("path/to/video.mp4").cuda() # Load your video here
|
74 |
+
latent_action = lam.idm(video)
|
75 |
+
```
|
76 |
+
|
77 |
+
4. Use image FDM to reconstruct future frames from the latent actions.
|
78 |
+
|
79 |
+
```python
|
80 |
+
frames = []
|
81 |
+
for i in range(len(latent_action[0])):
|
82 |
+
pred = lam.apply_latent_action(video[i], latent_action[0][i])
|
83 |
+
frames.append(pred)
|
84 |
+
```
|
85 |
+
|
86 |
+
We also provide a Jupyter [notebook](https://github.com/microsoft/villa-x/blob/main/demo/notebook.ipynb) for a step-by-step guide on how to use the pre-trained latent action model.
|
87 |
+
|
88 |
+
## Pre-trained Models
|
89 |
+
|
90 |
+
| Model ID | Description | Params | Link |
|
91 |
+
|----------------------|---------------------|--------|----------------------------------------------------------|
|
92 |
+
| `microsoft/villa-x/lam` | Latent action model | 955M | 🤗 [Link](https://huggingface.co/microsoft/villa-x/tree/main/lam) |
|
93 |
+
|
94 |
+
## Citation
|
95 |
+
```bibtex
|
96 |
+
@article{chen2025villa0x0,
|
97 |
+
title = {villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models},
|
98 |
+
author = {Xiaoyu Chen and Hangxing Wei and Pushi Zhang and Chuheng Zhang and Kaixin Wang and Yanjiang Guo and Rushuai Yang and Yucen Wang and Xinquan Xiao and Li Zhao and Jianyu Chen and Jiang Bian},
|
99 |
+
year = {2025},
|
100 |
+
journal = {arXiv preprint arXiv: 2507.23682}
|
101 |
+
}
|
102 |
+
```
|
103 |
+
|
104 |
+
## Credits
|
105 |
|
106 |
+
We are grateful for the open-source projects like [Open Sora](https://github.com/hpcaitech/Open-Sora), [taming-transformers](https://github.com/CompVis/taming-transformers), [open-pi-zero](https://github.com/allenzren/open-pi-zero), [MAE](https://github.com/facebookresearch/mae) and [timm](https://github.com/rwightman/pytorch-image-models). Their contributions have been invaluable in the development of villa-X.
|
|