Safetensors
qwen2_5_vl
Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <!-- ```yaml
3
+ datasets:
4
+ - MINT-SJTU/RoboFAC-dataset
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-7B-Instruct
7
+ ``` -->
8
+ # Model Card for RoboFAC-7B
9
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue)]() [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
10
+ RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ * **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
17
+ * **Model type:** Vision-Language Model (VLM) for robotic failure analysis
18
+ * **Languages:** English (instruction-tuned for robotic QA)
19
+ * **License:** Apache 2.0
20
+ * **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct
21
+
22
+
23
+ ---
24
+
25
+ ## Uses
26
+
27
+ ### Direct Use
28
+
29
+ The model is intended to be used in robotic systems as an *external critic*, to:
30
+
31
+ * Perform **task understanding** by answering what the robot is doing.
32
+ * Conduct **failure diagnosis** by identifying where and why it failed.
33
+ * Generate **correction suggestions** based on visual observations.
34
+
35
+ ### Downstream Use
36
+
37
+ The model can be integrated into:
38
+
39
+ * Vision-language control pipelines (e.g., VLA systems)
40
+ * Robotic operation monitoring tools
41
+ * Training agents with self-improvement capabilities
42
+ ---
43
+
44
+ ## Quickstart
45
+
46
+ ```python
47
+ from transformers import AutoProcessor, AutoModelForVision2Seq
48
+
49
+ model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
50
+ processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
51
+
52
+ # Example usage with image frames and a question
53
+ inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
54
+ outputs = model.generate(**inputs)
55
+ print(processor.batch_decode(outputs, skip_special_tokens=True))
56
+ ```
57
+
58
+
59
+ ## Citation
60
+
61
+ **BibTeX:**
62
+
63
+ ```bibtex
64
+ @misc{lu2025robofaccomprehensiveframeworkrobotic,
65
+ title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
66
+ author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
67
+ year={2025},
68
+ eprint={2505.12224},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.RO},
71
+ url={https://arxiv.org/abs/2505.12224}
72
+ }
73
+ ```
74
+