MINT-SJTU
/

RoboFAC-7B

Safetensors

qwen2_5_vl

Model card Files Files and versions Community

Update README.md

by Idphilosea - opened Jun 3

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+73

-74

Files changed (1) hide show

README.md +73 -74

README.md CHANGED Viewed

@@ -1,74 +1,73 @@
-<!-- ```yaml
-datasets:
-- MINT-SJTU/RoboFAC-dataset
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
-``` -->
-# Model Card for RoboFAC-7B
-[![Project Page](https://img.shields.io/badge/Project-Page-blue)]() [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
-RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
-## Model Details
-### Model Description
-* **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
-* **Model type:** Vision-Language Model (VLM) for robotic failure analysis
-* **Languages:** English (instruction-tuned for robotic QA)
-* **License:** Apache 2.0
-* **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct
----
-## Uses
-### Direct Use
-The model is intended to be used in robotic systems as an *external critic*, to:
-* Perform **task understanding** by answering what the robot is doing.
-* Conduct **failure diagnosis** by identifying where and why it failed.
-* Generate **correction suggestions** based on visual observations.
-### Downstream Use
-The model can be integrated into:
-* Vision-language control pipelines (e.g., VLA systems)
-* Robotic operation monitoring tools
-* Training agents with self-improvement capabilities
----
-## Quickstart
-```python
-from transformers import AutoProcessor, AutoModelForVision2Seq
-model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
-processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
-# Example usage with image frames and a question
-inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
-outputs = model.generate(**inputs)
-print(processor.batch_decode(outputs, skip_special_tokens=True))
-```
-## Citation
-**BibTeX:**
-```bibtex
-@misc{lu2025robofaccomprehensiveframeworkrobotic,
-  title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
-  author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
-  year={2025},
-  eprint={2505.12224},
-  archivePrefix={arXiv},
-  primaryClass={cs.RO},
-  url={https://arxiv.org/abs/2505.12224}
-}
-```

+---
+datasets:
+- MINT-SJTU/RoboFAC-dataset
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
+---
+# Model Card for RoboFAC-7B
+[![Project Page](https://img.shields.io/badge/Project-Page-blue)]() [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
+RoboFAC-7B is a large-scale vision-language model specifically finetuned for **robotic failure understanding and correction**. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.
+## Model Details
+### Model Description
+* **Developed by:** [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
+* **Model type:** Vision-Language Model (VLM) for robotic failure analysis
+* **Languages:** English (instruction-tuned for robotic QA)
+* **License:** Apache 2.0
+* **Finetuned from model:** Qwen/Qwen2.5-VL-7B-Instruct
+---
+## Uses
+### Direct Use
+The model is intended to be used in robotic systems as an *external critic*, to:
+* Perform **task understanding** by answering what the robot is doing.
+* Conduct **failure diagnosis** by identifying where and why it failed.
+* Generate **correction suggestions** based on visual observations.
+### Downstream Use
+The model can be integrated into:
+* Vision-language control pipelines (e.g., VLA systems)
+* Robotic operation monitoring tools
+* Training agents with self-improvement capabilities
+---
+## Quickstart
+```python
+from transformers import AutoProcessor, AutoModelForVision2Seq
+model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
+processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")
+# Example usage with image frames and a question
+inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs)
+print(processor.batch_decode(outputs, skip_special_tokens=True))
+```
+## Citation
+**BibTeX:**
+```bibtex
+@misc{lu2025robofaccomprehensiveframeworkrobotic,
+  title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
+  author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
+  year={2025},
+  eprint={2505.12224},
+  archivePrefix={arXiv},
+  primaryClass={cs.RO},
+  url={https://arxiv.org/abs/2505.12224}
+}
+```