UniVLA - Action Decoder (Deployment Head)
About VLA-Arena
VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:
- Task Structure: 170+ tasks grouped into four key dimensions:
- Safety: Operating reliably under strict constraints.
- Distractor: Handling environmental unpredictability.
- Extrapolation: Generalizing to unseen scenarios.
- Long Horizon: Executing complex, multi-step tasks.
- Language Command: Variations in instruction complexity.
- Visual Observation: Perturbations in visual input.
Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.
Model Overview
This model is the Action Decoder Head for UniVLA. Unlike the Latent Action Model (LAM) which is used for tokenizing video data, this decoder is a lightweight transformer module attached to the UniVLA backbone during deployment.
Its specific role is Detokenization: it takes the sequence of Latent Action Tokens (predicted by the VLM backbone) and Visual Embeddings, and decodes them into precise, continuous Action Chunks (7-DoF trajectories) executable by the robot.
Model Architecture
The Action Decoder is designed to bridge the gap between the discrete latent space of the VLM and the continuous action space of the robot. It utilizes Multi-Head Attention Pooling to extract context-specific features from both latent actions and visual observations.
| Component | Description |
|---|---|
| Input | Latent Action Embeddings + Visual Embeddings (VLM Last Layer) |
| Context Mechanism | Attention Pooling (Visual tokens query Action tokens) |
| Output | Action Chunks (Sequence of continuous poses) |
| Parameter Count | 12.6M (Lightweight Adapter) |
Architecture Configuration
The decoder consists of attention pooling layers followed by projection MLPs. For real-world deployment, it also includes a proprioceptive projection layer.
| Parameter | Value |
|---|---|
| Attention Heads | 8 |
| Head Dimension | 64 |
| Hidden Size | 512 |
| MLP Ratio | 4 |
| Proprioception Projection | 2 Layers (Hidden Size 512) |
Key Feature: Action Chunking
Unlike OpenVLA which predicts actions step-by-step, this decoder outputs Action Chunks (default size $N=12$ for real-world tasks). This allows for significantly smoother control and higher inference frequency ($\sim$10Hz).
Training Details
Dataset
This model was fine-tuned on the VLA-Arena/VLA_Arena_L0_L_rlds dataset.
Training Strategy
This decoder is trained end-to-end with the UniVLA backbone (via LoRA). While the backbone learns to predict the correct discrete latent token, this decoder simultaneously learns to map that token to the correct continuous physical action.
| Parameter | Value |
|---|---|
| Loss Function | L1 Loss (Ground Truth vs. Predicted Action) |
| Optimization | Joint optimization with VLM Next-Token Prediction |
| Visual Conditioning | Enabled (Visual embeddings used as queries) |
Evaluation & Usage
This model must be used in conjunction with the UniVLA backbone.
- Backbone Phase: The VLM predicts a sequence of discrete latent tokens (e.g.,
<ACT_1>,<ACT_2>). - Decoder Phase (This Model): These tokens, along with the visual context, are passed to this Action Decoder to generate the final $7\times N$ action vector (End-effector pose + Gripper).
Ablation studies show that this specific Visual-Attention Decoder outperforms standard auto-regressive decoding by 42.1% on long-horizon tasks (LIBERO-Long), proving its efficacy in reducing ambiguity and improving precision.
For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.