UniVLA - Action Decoder (Deployment Head)

About VLA-Arena

VLA-Arena is a comprehensive benchmark designed to quantitatively understand the limits and failure modes of Vision-Language-Action (VLA) models. While VLAs are advancing towards generalist robot policies, measuring their true capability frontiers remains challenging. VLA-Arena addresses this by proposing a novel structured task design framework that quantifies difficulty across three orthogonal axes:

  1. Task Structure: 170+ tasks grouped into four key dimensions:
    • Safety: Operating reliably under strict constraints.
    • Distractor: Handling environmental unpredictability.
    • Extrapolation: Generalizing to unseen scenarios.
    • Long Horizon: Executing complex, multi-step tasks.
  2. Language Command: Variations in instruction complexity.
  3. Visual Observation: Perturbations in visual input.

Tasks are designed with hierarchical difficulty levels (L0-L2). In this benchmark setting, fine-tuning is typically performed on L0 tasks to assess the model's ability to generalize to higher difficulty levels and strictly follow safety constraints.

Model Overview

This model is the Action Decoder Head for UniVLA. Unlike the Latent Action Model (LAM) which is used for tokenizing video data, this decoder is a lightweight transformer module attached to the UniVLA backbone during deployment.

Its specific role is Detokenization: it takes the sequence of Latent Action Tokens (predicted by the VLM backbone) and Visual Embeddings, and decodes them into precise, continuous Action Chunks (7-DoF trajectories) executable by the robot.


Model Architecture

The Action Decoder is designed to bridge the gap between the discrete latent space of the VLM and the continuous action space of the robot. It utilizes Multi-Head Attention Pooling to extract context-specific features from both latent actions and visual observations.

Component Description
Input Latent Action Embeddings + Visual Embeddings (VLM Last Layer)
Context Mechanism Attention Pooling (Visual tokens query Action tokens)
Output Action Chunks (Sequence of continuous poses)
Parameter Count 12.6M (Lightweight Adapter)

Architecture Configuration

The decoder consists of attention pooling layers followed by projection MLPs. For real-world deployment, it also includes a proprioceptive projection layer.

Parameter Value
Attention Heads 8
Head Dimension 64
Hidden Size 512
MLP Ratio 4
Proprioception Projection 2 Layers (Hidden Size 512)

Key Feature: Action Chunking

Unlike OpenVLA which predicts actions step-by-step, this decoder outputs Action Chunks (default size $N=12$ for real-world tasks). This allows for significantly smoother control and higher inference frequency ($\sim$10Hz).


Training Details

Dataset

This model was fine-tuned on the VLA-Arena/VLA_Arena_L0_L_rlds dataset.

Training Strategy

This decoder is trained end-to-end with the UniVLA backbone (via LoRA). While the backbone learns to predict the correct discrete latent token, this decoder simultaneously learns to map that token to the correct continuous physical action.

Parameter Value
Loss Function L1 Loss (Ground Truth vs. Predicted Action)
Optimization Joint optimization with VLM Next-Token Prediction
Visual Conditioning Enabled (Visual embeddings used as queries)

Evaluation & Usage

This model must be used in conjunction with the UniVLA backbone.

  1. Backbone Phase: The VLM predicts a sequence of discrete latent tokens (e.g., <ACT_1>, <ACT_2>).
  2. Decoder Phase (This Model): These tokens, along with the visual context, are passed to this Action Decoder to generate the final $7\times N$ action vector (End-effector pose + Gripper).

Ablation studies show that this specific Visual-Attention Decoder outperforms standard auto-regressive decoding by 42.1% on long-horizon tasks (LIBERO-Long), proving its efficacy in reducing ambiguity and improving precision.

For detailed evaluation instructions, metrics, and scripts, please refer to the VLA-Arena repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train VLA-Arena/univla-action-decoder