MINT-SJTU
/

RoboFAC-7B

Model card Files Files and versions Community

RoboFAC-7B / README.md

yyymh's picture

add project page url

7d55520 verified 2 months ago

|

history blame contribute delete

2.64 kB

	---
	datasets:
	- MINT-SJTU/RoboFAC-dataset
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	---

	# Model Card for RoboFAC-7B
	[![Project Page](https://img.shields.io/badge/Project-Page-blue)](https://mint-sjtu.github.io/RoboFAC.io/) [![Paper](https://img.shields.io/badge/Paper-PDF-red)](https://arxiv.org/abs/2505.12224) [![Dataset](https://img.shields.io/badge/Dataset-Huggingface-green)](https://huggingface.co/datasets/MINT-SJTU/RoboFAC-dataset) [![Model](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/MINT-SJTU/RoboFAC-7B)
	RoboFAC-7B is a large-scale vision-language model specifically finetuned for robotic failure understanding and correction. It takes in visual observations of robot executions (usually video frames) and outputs detailed answers to questions that analyze, diagnose, and propose corrections for robotic manipulation failures.

	## Model Details

	### Model Description

	* Developed by: [MINT Lab, Shanghai Jiao Tong University](https://mint-sjtu.github.io/)
	* Model type: Vision-Language Model (VLM) for robotic failure analysis
	* Languages: English (instruction-tuned for robotic QA)
	* License: Apache 2.0
	* Finetuned from model: Qwen/Qwen2.5-VL-7B-Instruct


	---

	## Uses

	### Direct Use

	The model is intended to be used in robotic systems as an external critic, to:

	* Perform task understanding by answering what the robot is doing.
	* Conduct failure diagnosis by identifying where and why it failed.
	* Generate correction suggestions based on visual observations.

	### Downstream Use

	The model can be integrated into:

	* Vision-language control pipelines (e.g., VLA systems)
	* Robotic operation monitoring tools
	* Training agents with self-improvement capabilities
	---

	## Quickstart

	```python
	from transformers import AutoProcessor, AutoModelForVision2Seq

	model = AutoModelForVision2Seq.from_pretrained("MINT-SJTU/RoboFAC-7B")
	processor = AutoProcessor.from_pretrained("MINT-SJTU/RoboFAC-7B")

	# Example usage with image frames and a question
	inputs = processor(images=[...], text="Why did the robot fail?", return_tensors="pt").to("cuda")
	outputs = model.generate(**inputs)
	print(processor.batch_decode(outputs, skip_special_tokens=True))
	```


	## Citation

	BibTeX:

	```bibtex
	@misc{lu2025robofaccomprehensiveframeworkrobotic,
	title={RoboFAC: A Comprehensive Framework for Robotic Failure Analysis and Correction},
	author={Weifeng Lu and Minghao Ye and Zewei Ye and Ruihan Tao and Shuo Yang and Bo Zhao},
	year={2025},
	eprint={2505.12224},
	archivePrefix={arXiv},
	primaryClass={cs.RO},
	url={https://arxiv.org/abs/2505.12224}
	}
	```