VMBench / README.md

Update README.md

437cf3b verified 6 months ago

6.79 kB

	---
	license: apache-2.0
	language:
	- en
	---
	<p align="center">
	<img src="./asset/logo.png" width="80%"/>
	</p>

	# 🔥 Updates

	* \[3/2024\] VMBench evaluation code & prompt set released!


	# 📣 Overview

	<p align="center">
	<img src="./asset/overview.png" width="100%"/>
	</p>


	Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench---a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman’s correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment.

	# 📊Evaluation Results


	## Quantitative Results

	<p align="center">
	<img src="./asset/eval_result.png" width="80%"/>
	</p>

	### VMBench Leaderboard

	<div align="center">

	\| Models \| Avg \| CAS \| MSS \| OIS \| PAS \| TCS \|
	\| -------------------- \| -------- \| -------- \| -------- \| -------- \| -------- \| -------- \|
	\| OpenSora-v1.2 \| 51.6 \| 31.2 \| 61.9 \| 73.0 \| 3.4 \| 88.5 \|
	\| Mochi 1 \| 53.2 \| 37.7 \| 62.0 \| 68.6 \| 14.4 \| 83.6 \|
	\| OpenSora-Plan-v1.3.0 \| 58.9 \| 39.3 \| 76.0 \| 78.6 \| 6.0 \| 94.7 \|
	\| CogVideoX-5B \| 60.6 \| 50.6 \| 61.6 \| 75.4 \| 24.6 \| 91.0 \|
	\| HunyuanVideo \| 63.4 \| 51.9 \| 81.6 \| 65.8 \| 26.1 \| 96.3 \|
	\| Wan2.1 \| 78.4 \| 62.8 \| 84.2 \| 66.0 \| 17.9 \| 97.8 \|

	</div>

	# 🔨 Installation

	## Create Environment

	```shell
	git clone https://github.com/Ran0618/VMBench.git
	cd VMBench

	# create conda environment
	conda create -n VMBench python=3.10
	pip install torch torchvision

	# Install Grounded-Segment-Anything module
	cd Grounded-Segment-Anything
	python -m pip install -e segment_anything
	pip install --no-build-isolation -e GroundingDINO
	pip install -r requirements.txt

	# Install Groudned-SAM-2 module
	cd Grounded-SAM-2
	pip install -e .

	# Install MMPose toolkit
	pip install -U openmim
	mim install mmengine
	mim install "mmcv==2.1.0"

	# Install Q-Align module
	cd Q-Align
	pip install -e .

	# Install VideoMAEv2 module
	cd VideoMAEv2
	pip install -r requirements.txt
	```

	## Download checkpoints
	Place the pre-trained checkpoint files in the `.cache` directory.
	You can download our model's checkpoints are from our [HuggingFace repository 🤗](https://huggingface.co/GD-ML/VMBench).

	```shell
	mkdir .cache
	cd .cache

	huggingface-cli download GD-ML/VMBench --local-dir .cache/
	```
	Please organize the pretrained models in this structure:
	```shell
	VMBench/.cache
	├── groundingdino_swinb_cogcoor.pth
	├── sam2.1_hiera_large.pt
	├── sam_vit_h_4b8939.pth
	├── scaled_offline.pth
	└── vit_g_vmbench.pt
	```

	# 🔧Usage

	## Videos Preparation

	Generate videos of your model using the 1050 prompts provided in `prompts/prompts.txt` or `prompts/prompts.json` and organize them in the following structure:

	```shell
	VMBench/eval_results/videos
	├── 0001.mp4
	├── 0002.mp4
	...
	└── 1050.mp4
	```

	Note: Ensure that you maintain the correspondence between prompts and video sequence numbers. The index for each prompt can be found in the `prompts/prompts.json` file.

	You can follow us `sample_video_demo.py` to generate videos. Or you can put the results video named index into your own folder.


	## Evaluation on the VMBench

	### Running the Evaluation Pipeline
	To evaluate generated videos using the VMBench, run the following command:

	```shell
	bash evaluate.sh your_videos_folder
	```

	The evaluation results for each video will be saved in the `./eval_results/${current_time}/results.json`. Scores for each dimension will be saved as `./eval_results/${current_time}/scores.csv`.

	### Evaluation Efficiency

	We conducted a test using the following configuration:

	- Model: CogVideoX-5B
	- Number of Videos: 1,050
	- Frames per Video: 49
	- Frame Rate: 8 FPS

	Here are the time measurements for each evaluation metric:

	\| Metric \| Time Taken \|
	\|--------\|------------\|
	\| PAS (Perceptible Amplitude Score) \| 45 minutes \|
	\| OIS (Object Integrity Score) \| 30 minutes \|
	\| TCS (Temporal Coherence Score) \| 2 hours \|
	\| MSS (Motion Smoothness Score) \| 2.5 hours \|
	\| CAS (Commonsense Adherence Score) \| 1 hour \|

	Total Evaluation Time: 6 hours and 45 minutes

	# ❤️Acknowledgement
	We would like to express our gratitude to the following open-source repositories that our work is based on: [GroundedSAM](https://github.com/IDEA-Research/Grounded-Segment-Anything), [GroundedSAM2](https://github.com/IDEA-Research/Grounded-SAM-2), [Co-Tracker](https://github.com/facebookresearch/co-tracker), [MMPose](https://github.com/open-mmlab/mmpose), [Q-Align](https://github.com/Q-Future/Q-Align), [VideoMAEv2](https://github.com/OpenGVLab/VideoMAEv2), [VideoAlign](https://github.com/KwaiVGI/VideoAlign).
	Their contributions have been invaluable to this project.

	# 📜License
	The VMBench is licensed under [Apache-2.0 license](http://www.apache.org/licenses/LICENSE-2.0). You are free to use our codes for research purpose.

	# ✏️Citation
	If you find our repo useful for your research, please consider citing our paper:
	```bibtex
	@misc{ling2025vmbenchbenchmarkperceptionalignedvideo,
	title={VMBench: A Benchmark for Perception-Aligned Video Motion Generation},
	author={Xinran Ling and Chen Zhu and Meiqi Wu and Hangyu Li and Xiaokun Feng and Cundian Yang and Aiming Hao and Jiashu Zhu and Jiahong Wu and Xiangxiang Chu},
	year={2025},
	eprint={2503.10076},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2503.10076},
	}
	```