FlowerVLA - Vision-Language-Action Flow Model for CALVIN ABC

This is a pretrained FlowerVLA model for robotic manipulation trained on the CALVIN ABC dataset. Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.

Model Description

FlowerVLA is a novel architecture that:

  • Uses half of Florence-2 for multi-modal vision-language encoding
  • Employs an novel transformer-based flow matching architecture
  • Provides an efficient, versatile VLA policy with only ~1B parameters

Model Performance

This checkpoint contains weights for the SIMPLER results reported in the paper. Check out the pretraining codebase for testing.

Input/Output Specifications

Inputs

  • RGB Static Camera: (B, T, 3, H, W) tensor
  • RGB Gripper Camera: (B, T, 3, H, W) tensor
  • Language Instructions: Text strings

Outputs

  • Action Space: (B, T, 7/8) tensor representing delta EEF actions/Joint State Actions

Usage

Check out our full model implementation on Github todo and follow the instructions in the readme to test the model on one of the environments.

obs = {
    "rgb_obs": {
        "rgb_static": static_image,
        "rgb_gripper": gripper_image
    }
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)

@inproceedings{ reuss2025flower, # Add citation when available }

License

This model is released under the MIT license.

Downloads last month
7
Video Preview
loading

Model tree for mbreuss/flower_vla_pret

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including mbreuss/flower_vla_pret