--- datasets: - MINT-SJTU/RoboFAC-dataset base_model: - Qwen/Qwen2.5-VL-7B-Instruct --- # Model Card for RoboFAC-7B [![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://mint-sjtu.github.io/RoboFAC.io/) [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B) RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures. ## Model Details ### Model Description * **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/) * **Model type:** Vision-Language Model (VLM) for robotic failure analysis * **Languages:** English (instruction-tuned for robotic QA) * **License:** Apache 2.0 * **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct --- ## Uses ### Direct Use The model is intended to be used in robotic systems as an *external critic*, to: * Perform **task understanding** by answering what the robot is doing. * Conduct **failure diagnosis** by identifying where and why it failed. * Generate **correction suggestions** based on visual observations. ### Downstream Use The model can be integrated into: * Vision-language control pipelines (e.g., VLA systems) * Robotic operation monitoring tools * Training agents with self-improvement capabilities --- ## Quickstart ```python from transformers import AutoProcessor, AutoModelForVision2Seq model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B") processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B") # Example usage with image frames and a question inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda") outputs = model.generate(**inputs) print(processor.batch_decode(outputs, skip_special_tokens=True)) ``` ## Citation **BibTeX:** ```bibtex @misc{lu2025robofaccomprehensiveframeworkrobotic, title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction}, author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao}, year={2025}, eprint={2505.12224}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2505.12224} } ```