FlowerVLA - Vision-Language-Action Flow Model for CALVIN ABC
This is a pretrained FlowerVLA model for robotic manipulation trained on the CALVIN ABC dataset.
Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters.
Model Description
FlowerVLA is a novel architecture that:
- Uses half of Florence-2 for multi-modal vision-language encoding
- Employs an novel transformer-based flow matching architecture
- Provides an efficient, versatile VLA policy with only ~1B parameters
Model Performance
This checkpoint contains weights for the SIMPLER results reported in the paper. Check out the pretraining codebase for testing.
Input/Output Specifications
Inputs
- RGB Static Camera:
(B, T, 3, H, W)
tensor
- RGB Gripper Camera:
(B, T, 3, H, W)
tensor
- Language Instructions: Text strings
Outputs
- Action Space:
(B, T, 7/8)
tensor representing delta EEF actions/Joint State Actions
Usage
Check out our full model implementation on Github todo and follow the instructions in the readme to test the model on one of the environments.
obs = {
"rgb_obs": {
"rgb_static": static_image,
"rgb_gripper": gripper_image
}
}
goal = {"lang_text": "pick up the blue cube"}
action = model.step(obs, goal)
@inproceedings{
reuss2025flower,
# Add citation when available
}
License
This model is released under the MIT license.