Add pipeline tag and library name, include dataset, training, and evaluation information
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,7 +1,9 @@
|
|
1 |
---
|
2 |
-
license: apache-2.0
|
3 |
base_model:
|
4 |
- OpenGVLab/VideoChat2_stage3_Mistral_7B
|
|
|
|
|
|
|
5 |
---
|
6 |
|
7 |
<a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
|
@@ -48,4 +50,117 @@ python inference.py \
|
|
48 |
--question "Describe this video." \
|
49 |
--model_max_length 1024
|
50 |
|
51 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
|
|
2 |
base_model:
|
3 |
- OpenGVLab/VideoChat2_stage3_Mistral_7B
|
4 |
+
license: apache-2.0
|
5 |
+
pipeline_tag: video-text-to-text
|
6 |
+
library_name: transformers
|
7 |
---
|
8 |
|
9 |
<a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
|
|
|
50 |
--question "Describe this video." \
|
51 |
--model_max_length 1024
|
52 |
|
53 |
+
```
|
54 |
+
|
55 |
+
## Dataset
|
56 |
+
|
57 |
+
Our training data is released here [Self-Alignment Dataset](https://huggingface.co/datasets/pritamqu/self-alignment). We release the preferred and non-preferred responses used in self-alignment training.
|
58 |
+
```
|
59 |
+
git clone [email protected]:datasets/pritamqu/self-alignment
|
60 |
+
```
|
61 |
+
The related videos can be downloaded from their original sources. Please check [VideoChat-IT](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) GitHub page regarding the details of downloading the source videos.
|
62 |
+
|
63 |
+
We also share additional details on how to use your own data [here](docs/DATA.md).
|
64 |
+
|
65 |
+
## Training
|
66 |
+
|
67 |
+
Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:
|
68 |
+
|
69 |
+
VideoChat2
|
70 |
+
```
|
71 |
+
bash scripts/videochat2/run.sh
|
72 |
+
```
|
73 |
+
LLaVA-Video
|
74 |
+
```
|
75 |
+
bash scripts/llavavideo/run.sh
|
76 |
+
```
|
77 |
+
LongVU
|
78 |
+
```
|
79 |
+
bash scripts/longvu/run.sh
|
80 |
+
```
|
81 |
+
The link to the base model weights are:
|
82 |
+
- [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B)
|
83 |
+
- [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2)
|
84 |
+
- [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B)
|
85 |
+
|
86 |
+
|
87 |
+
## Inference
|
88 |
+
|
89 |
+
We provide a simple setup to inference using our trained model.
|
90 |
+
|
91 |
+
**VideoChat2**
|
92 |
+
```
|
93 |
+
bash scripts/inference_videochat2.sh
|
94 |
+
```
|
95 |
+
|
96 |
+
**LLaVA-Video**
|
97 |
+
```
|
98 |
+
bash scripts/inference_llavavideo.sh
|
99 |
+
```
|
100 |
+
|
101 |
+
**LongVU**
|
102 |
+
```
|
103 |
+
bash scripts/inference_longvu.sh
|
104 |
+
```
|
105 |
+
|
106 |
+
## Results
|
107 |
+
|
108 |
+
**RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.**
|
109 |
+
|
110 |
+
| **Models** | **#F** | **TV Bench** | **Temp Compass** | **Video Hallucer** | **Vid Halluc** | **MV Bench** | **Video MME** | **MLVU** | **LongVideo Bench** |
|
111 |
+
|------------|------|-------------|----------------|----------------|-------------|-------------|-------------|--------|------------------|
|
112 |
+
| VideoChat2 | 16 | 44.0 | 59.3 | 23.1 | 73.3 | **60.2** | 41.0 | 46.4 | 40.4 |
|
113 |
+
| VideoChat2 + DPO | 16 | 45.7 | 60.0 | 22.1 | 72.4 | 59.6 | 43.0 | 47.4 | 41.0 |
|
114 |
+
| VideoChat2 + **RRPO** | 16 | **45.8** | **60.2** | **32.9** | **76.4** | 59.0 | **44.3** | **47.9** | **42.8** |
|
115 |
+
| | | | | | | | | | |
|
116 |
+
| LLaVA-Video | 64 | 51.0 | 66.0 | 50.0 | 76.6 | 61.1 | 64.0 | 68.6 | 60.1 |
|
117 |
+
| LLaVA-Video + DPO | 64 | 51.9 | 66.4 | 53.3 | 76.5 | 60.6 | 63.1 | 67.4 | 59.4 |
|
118 |
+
| LLaVA-Video + **RRPO** | 64 | 51.9 | 66.8 | 55.7 | 76.5 | **62.2** | **64.5** | 69.1 | **60.4** |
|
119 |
+
| LLaVA-Video + **RRPO** (32f) | 64 | **52.2** | **67.4** | **55.8** | **76.6** | 62.1 | **64.5** | **69.4** | 60.1 |
|
120 |
+
| | | | | | | | | | |
|
121 |
+
| LongVU | 1fps | 53.7 | 63.9 | 39.2 | 67.3 | 65.5 | 56.2 | 63.6 | 48.6 |
|
122 |
+
| LongVU + DPO | 1fps | 54.3 | 64.3 | 40.9 | 68.5 | 65.9 | 56.6 | 63.6 | 49.4 |
|
123 |
+
| LongVU + **RRPO** | 1fps | **56.5** | **64.5** | **44.0** | **71.7** | **66.8** | **57.7** | **64.5** | **49.7** |
|
124 |
+
|
125 |
+
|
126 |
+
## Evaluation
|
127 |
+
|
128 |
+
You can download evaluation benchmarks from the given links:
|
129 |
+
|
130 |
+
- [TVBench](https://huggingface.co/datasets/FunAILab/TVBench)
|
131 |
+
- [TempCompass](https://huggingface.co/datasets/lmms-lab/TempCompass)
|
132 |
+
- [VideoHallucer](https://huggingface.co/datasets/bigai-nlco/VideoHallucer)
|
133 |
+
- [VidHalluc](https://huggingface.co/datasets/chaoyuli/VidHalluc)
|
134 |
+
- [MVBench](https://huggingface.co/datasets/PKU-Alignment/MVBench)
|
135 |
+
- [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME)
|
136 |
+
- [MLVU](https://huggingface.co/datasets/MLVU/MVLU)
|
137 |
+
- [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench)
|
138 |
+
|
139 |
+
Next, you can run the entire evaluations following the instructions provided [here](./docs/EVALUATION.md).
|
140 |
+
|
141 |
+
|
142 |
+
## Citation
|
143 |
+
|
144 |
+
If you find this work useful, please consider citing our paper:
|
145 |
+
|
146 |
+
```
|
147 |
+
@article{sarkar2025rrpo,
|
148 |
+
title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
|
149 |
+
author={Your Name et al.},
|
150 |
+
journal={arXiv preprint arXiv:2504.12083},
|
151 |
+
year={2025}
|
152 |
+
}
|
153 |
+
```
|
154 |
+
|
155 |
+
## Usage and License Notices
|
156 |
+
|
157 |
+
This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses.
|
158 |
+
The assets used in this work include, but are not limited to:
|
159 |
+
[VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT),
|
160 |
+
[VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B),
|
161 |
+
[LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),
|
162 |
+
[LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B). This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations.
|
163 |
+
This repository is released under the **Apache 2.0 License**. See [LICENSE](LICENSE) for details.
|
164 |
+
|
165 |
+
---
|
166 |
+
For any issues or questions, please open an issue or contact **Pritam Sarkar** at [email protected]!
|