---
license: apache-2.0
datasets:
- TIGER-Lab/VideoEval
language:
- en
metrics:
- accuracy
library_name: transformers
pipeline_tag: visual-question-answering
---

# ![MantisScore_logo](https://tiger-ai-lab.github.io/MantisScore/static/images/logo3.png) MantisScore


[Paper] | [Website](https://tiger-ai-lab.github.io/MantisScore/) | [Github](https://github.com/TIGER-AI-Lab/MantisScore) | [Datasets](https://huggingface.co/datasets/TIGER-Lab/VideoEval) | [Model](https://huggingface.co/TIGER-Lab/MantisScore) | [Demo](https://huggingface.co/spaces/Mantis-VL/MantisScore)


![MantisScore](https://tiger-ai-lab.github.io/MantisScore/static/images/teaser.png)

## Introduction
- MantisScore is a video quality evaluation model, taking [Mantis-8B-Idefics2](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) as base-model
and trained on [VideoEval](https://huggingface.co/datasets/TIGER-Lab/VideoEval),
a large video evaluation dataset with multi-aspect human scores. 

- MantisScore can reach 75+ Spearman correlation with humans on VideoEval-test, surpassing all the MLLM-prompting methods and feature-based metrics. 

- MantisScore also beat the best baselines on other three benchmarks EvalCrafter, GenAI-Bench and VBench, showing high alignment with human evaluations.

## Performance
### Evaluation Results on 4 benchmarks.

We test our video evaluation model MantisScore on VideoEval-test, EvalCrafter, GenAI-Bench and VBench.
For the first two benchmarks, we take Spearman corrleation between model's output and human ratings 
averaged among all the evaluation aspects as indicator. 
For GenAI-Bench and VBench, which include human preference data among two or more videos, 
we employ the model's output to predict preferences and use pairwise accuracy as the performance indicator.
| metric           | Final Sum Score | VideoEval-test | EvalCrafter | GenAI-Bench | VBench |
|------------------|----------------:|---------------:|------------:|-------------|--------|
| MantisScore      |                 |                |             |             |        |
| Gemini-1.5-Pro   |           158.8 |           22.1 |        22.9 |        60.9 |   52.9 |
| Gemini-1.5-Flash |           157.5 |           20.8 |        17.3 |        67.1 |   52.3 |
| GPT-4o           |           155.4 |           23.1 |        28.7 |        52.0 |   51.7 |
| CLIP-sim         |           126.8 |            8.9 |        36.2 |        34.2 |   47.4 |
| DINO-sim         |           121.3 |            7.5 |        32.1 |        38.5 |   43.3 |
| SSIM-sim         |           118.0 |           13.4 |        26.9 |        34.1 |   43.5 |
| CLIP-Score       |           114.4 |           -7.2 |        21.7 |        45.0 |   54.9 |
| LLaVA-1.5-7B     |           108.3 |            8.5 |        10.5 |        49.9 |   39.4 |
| LLaVA-1.6-7B     |            93.3 |           -3.1 |        13.2 |        44.5 |   38.7 |
| X-CLIP-Score     |            92.9 |           -1.9 |        13.3 |        41.4 |   40.1 |
| PIQE             |            78.3 |          -10.1 |        -1.2 |        34.5 |   55.1 |
| BRISQUE          |            75.9 |          -20.3 |         3.9 |        38.5 |   53.7 |
| SSIM-dyn         |            42.5 |           -5.5 |       -17.0 |        28.4 |   36.5 |
| MES-dyn          |            36.7 |          -12.9 |       -26.4 |        31.4 |   44.5 |


## Usage
### Installation
```bash
pip install git+https://github.com/TIGER-AI-Lab/MantisScore.git
```

### Inference

### Training
MantisScore is trained on 

### Evaluation

## Citation