Safetensors
qwen2_5_vl
Idphilosea commited on
Commit
7762502
·
verified ·
1 Parent(s): a4241d3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -74
README.md CHANGED
@@ -1,74 +1,73 @@
1
-
2
- <!-- ```yaml
3
- datasets:
4
- - MINT-SJTU/RoboFAC-dataset
5
- base_model:
6
- - Qwen/Qwen2.5-VL-7B-Instruct
7
- ``` -->
8
- # Model Card for RoboFAC-7B
9
- [![Project Page](https://img.shields.io/badge/Project-Page-blue)]() [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
10
- RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- * **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
17
- * **Model type:** Vision-Language Model (VLM) for robotic failure analysis
18
- * **Languages:** English (instruction-tuned for robotic QA)
19
- * **License:** Apache 2.0
20
- * **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct
21
-
22
-
23
- ---
24
-
25
- ## Uses
26
-
27
- ### Direct Use
28
-
29
- The model is intended to be used in robotic systems as an *external critic*, to:
30
-
31
- * Perform **task understanding** by answering what the robot is doing.
32
- * Conduct **failure diagnosis** by identifying where and why it failed.
33
- * Generate **correction suggestions** based on visual observations.
34
-
35
- ### Downstream Use
36
-
37
- The model can be integrated into:
38
-
39
- * Vision-language control pipelines (e.g., VLA systems)
40
- * Robotic operation monitoring tools
41
- * Training agents with self-improvement capabilities
42
- ---
43
-
44
- ## Quickstart
45
-
46
- ```python
47
- from transformers import AutoProcessor, AutoModelForVision2Seq
48
-
49
- model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
50
- processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
51
-
52
- # Example usage with image frames and a question
53
- inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
54
- outputs = model.generate(**inputs)
55
- print(processor.batch_decode(outputs, skip_special_tokens=True))
56
- ```
57
-
58
-
59
- ## Citation
60
-
61
- **BibTeX:**
62
-
63
- ```bibtex
64
- @misc{lu2025robofaccomprehensiveframeworkrobotic,
65
- title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
66
- author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
67
- year={2025},
68
- eprint={2505.12224},
69
- archivePrefix={arXiv},
70
- primaryClass={cs.RO},
71
- url={https://arxiv.org/abs/2505.12224}
72
- }
73
- ```
74
-
 
1
+ ---
2
+ datasets:
3
+ - MINT-SJTU/RoboFAC-dataset
4
+ base_model:
5
+ - Qwen/Qwen2.5-VL-7B-Instruct
6
+ ---
7
+
8
+ # Model Card for RoboFAC-7B
9
+ [![Project Page](https://img.shields.io/badge/Project-Page-blue)]() [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
10
+ RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ * **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
17
+ * **Model type:** Vision-Language Model (VLM) for robotic failure analysis
18
+ * **Languages:** English (instruction-tuned for robotic QA)
19
+ * **License:** Apache 2.0
20
+ * **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct
21
+
22
+
23
+ ---
24
+
25
+ ## Uses
26
+
27
+ ### Direct Use
28
+
29
+ The model is intended to be used in robotic systems as an *external critic*, to:
30
+
31
+ * Perform **task understanding** by answering what the robot is doing.
32
+ * Conduct **failure diagnosis** by identifying where and why it failed.
33
+ * Generate **correction suggestions** based on visual observations.
34
+
35
+ ### Downstream Use
36
+
37
+ The model can be integrated into:
38
+
39
+ * Vision-language control pipelines (e.g., VLA systems)
40
+ * Robotic operation monitoring tools
41
+ * Training agents with self-improvement capabilities
42
+ ---
43
+
44
+ ## Quickstart
45
+
46
+ ```python
47
+ from transformers import AutoProcessor, AutoModelForVision2Seq
48
+
49
+ model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
50
+ processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
51
+
52
+ # Example usage with image frames and a question
53
+ inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
54
+ outputs = model.generate(**inputs)
55
+ print(processor.batch_decode(outputs, skip_special_tokens=True))
56
+ ```
57
+
58
+
59
+ ## Citation
60
+
61
+ **BibTeX:**
62
+
63
+ ```bibtex
64
+ @misc{lu2025robofaccomprehensiveframeworkrobotic,
65
+ title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
66
+ author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
67
+ year={2025},
68
+ eprint={2505.12224},
69
+ archivePrefix={arXiv},
70
+ primaryClass={cs.RO},
71
+ url={https://arxiv.org/abs/2505.12224}
72
+ }
73
+ ```