File size: 2,644 Bytes
65fd0aa 7d55520 65fd0aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
datasets:
- MINT-SJTU/RoboFAC-dataset
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---
# Model Card for RoboFAC-7B
[](https://mint-sjtu.github.io/RoboFAC.io/) [](https://arxiv.org/abs/2505.12224) [](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
## Model Details
### Model Description
* **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
* **Model type:** Vision-Language Model (VLM) for robotic failure analysis
* **Languages:** English (instruction-tuned for robotic QA)
* **License:** Apache 2.0
* **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct
---
## Uses
### Direct Use
The model is intended to be used in robotic systems as an *external critic*, to:
* Perform **task understanding** by answering what the robot is doing.
* Conduct **failure diagnosis** by identifying where and why it failed.
* Generate **correction suggestions** based on visual observations.
### Downstream Use
The model can be integrated into:
* Vision-language control pipelines (e.g., VLA systems)
* Robotic operation monitoring tools
* Training agents with self-improvement capabilities
---
## Quickstart
```python
from transformers import AutoProcessor, AutoModelForVision2Seq
model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
# Example usage with image frames and a question
inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs, skip_special_tokens=True))
```
## Citation
**BibTeX:**
```bibtex
@misc{lu2025robofaccomprehensiveframeworkrobotic,
title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
year={2025},
eprint={2505.12224},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2505.12224}
}
```
|