Safetensors
qwen2_5_vl
File size: 2,644 Bytes
65fd0aa
 
 
 
 
 
 
 
7d55520
65fd0aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
datasets:
- MINT-SJTU/RoboFAC-dataset
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
---

# Model Card for RoboFAC-7B
[![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://mint-sjtu.github.io/RoboFAC.io/) [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.

## Model Details

### Model Description

* **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
* **Model type:** Vision-Language Model (VLM) for robotic failure analysis
* **Languages:** English (instruction-tuned for robotic QA)
* **License:** Apache 2.0
* **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct


---

## Uses

### Direct Use

The model is intended to be used in robotic systems as an *external critic*, to:

* Perform **task understanding** by answering what the robot is doing.
* Conduct **failure diagnosis** by identifying where and why it failed.
* Generate **correction suggestions** based on visual observations.

### Downstream Use

The model can be integrated into:

* Vision-language control pipelines (e.g., VLA systems)
* Robotic operation monitoring tools
* Training agents with self-improvement capabilities
---

## Quickstart

```python
from transformers import AutoProcessor, AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")

# Example usage with image frames and a question
inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs)
print(processor.batch_decode(outputs, skip_special_tokens=True))
```


## Citation

**BibTeX:**

```bibtex
@misc{lu2025robofaccomprehensiveframeworkrobotic,
  title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
  author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
  year={2025},
  eprint={2505.12224},
  archivePrefix={arXiv},
  primaryClass={cs.RO},
  url={https://arxiv.org/abs/2505.12224}
}
```