Model Card for RoboFAC-7B

RoboFAC-7B is a large-scale vision-language model specifically finetuned for robotic failure understanding and correction. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.

Model Details

Model Description

Developed by: MINT Lab, Shanghai Jiao Tong University
Model type: Vision-Language Model (VLM) for robotic failure analysis
Languages: English (instruction-tuned for robotic QA)
License: Apache 2.0
Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct

Uses

Direct Use

The model is intended to be used in robotic systems as an external critic, to:

Perform task understanding by answering what the robot is doing.
Conduct failure diagnosis by identifying where and why it failed.
Generate correction suggestions based on visual observations.

Downstream Use

The model can be integrated into:

Vision-language control pipelines (e.g., VLA systems)
Robotic operation monitoring tools
Training agents with self-improvement capabilities

Quickstart

from transformers import AutoProcessor, AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")

# Example usage with image frames and a question
inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs, skip_special_tokens=True))

Citation

BibTeX:

@misc{lu2025robofaccomprehensiveframeworkrobotic,
  title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
  author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
  year={2025},
  eprint={2505.12224},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2505.12224}
}