Model Card for RoboFAC-7B
RoboFAC-7B is a large-scale vision-language model specifically finetuned for robotic failure understanding and correction. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
Model Details
Model Description
- Developed by: MINT Lab, Shanghai Jiao Tong University
- Model type: Vision-Language Model (VLM) for robotic failure analysis
- Languages: English (instruction-tuned for robotic QA)
- License: Apache 2.0
- Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct
Uses
Direct Use
The model is intended to be used in robotic systems as an external critic, to:
- Perform task understanding by answering what the robot is doing.
- Conduct failure diagnosis by identifying where and why it failed.
- Generate correction suggestions based on visual observations.
Downstream Use
The model can be integrated into:
- Vision-language control pipelines (e.g., VLA systems)
- Robotic operation monitoring tools
- Training agents with self-improvement capabilities
Quickstart
from transformers import AutoProcessor, AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
# Example usage with image frames and a question
inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs, skip_special_tokens=True))
Citation
BibTeX:
@misc{lu2025robofaccomprehensiveframeworkrobotic,
title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
year={2025},
eprint={2505.12224},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.12224}
}