Add pipeline tag and library name, include dataset, training, and evaluation information

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +117 -2
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - OpenGVLab/VideoChat2_stage3_Mistral_7B
 
 
 
5
  ---
6
 
7
  <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
@@ -48,4 +50,117 @@ python inference.py \
48
  --question "Describe this video." \
49
  --model_max_length 1024
50
 
51
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - OpenGVLab/VideoChat2_stage3_Mistral_7B
4
+ license: apache-2.0
5
+ pipeline_tag: video-text-to-text
6
+ library_name: transformers
7
  ---
8
 
9
  <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
 
50
  --question "Describe this video." \
51
  --model_max_length 1024
52
 
53
+ ```
54
+
55
+ ## Dataset
56
+
57
+ Our training data is released here [Self-Alignment Dataset](https://huggingface.co/datasets/pritamqu/self-alignment). We release the preferred and non-preferred responses used in self-alignment training.
58
+ ```
59
+ git clone [email protected]:datasets/pritamqu/self-alignment
60
+ ```
61
+ The related videos can be downloaded from their original sources. Please check [VideoChat-IT](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) GitHub page regarding the details of downloading the source videos.
62
+
63
+ We also share additional details on how to use your own data [here](docs/DATA.md).
64
+
65
+ ## Training
66
+
67
+ Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:
68
+
69
+ VideoChat2
70
+ ```
71
+ bash scripts/videochat2/run.sh
72
+ ```
73
+ LLaVA-Video
74
+ ```
75
+ bash scripts/llavavideo/run.sh
76
+ ```
77
+ LongVU
78
+ ```
79
+ bash scripts/longvu/run.sh
80
+ ```
81
+ The link to the base model weights are:
82
+ - [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B)
83
+ - [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2)
84
+ - [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B)
85
+
86
+
87
+ ## Inference
88
+
89
+ We provide a simple setup to inference using our trained model.
90
+
91
+ **VideoChat2**
92
+ ```
93
+ bash scripts/inference_videochat2.sh
94
+ ```
95
+
96
+ **LLaVA-Video**
97
+ ```
98
+ bash scripts/inference_llavavideo.sh
99
+ ```
100
+
101
+ **LongVU**
102
+ ```
103
+ bash scripts/inference_longvu.sh
104
+ ```
105
+
106
+ ## Results
107
+
108
+ **RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.**
109
+
110
+ | **Models** | **#F** | **TV Bench** | **Temp Compass** | **Video Hallucer** | **Vid Halluc** | **MV Bench** | **Video MME** | **MLVU** | **LongVideo Bench** |
111
+ |------------|------|-------------|----------------|----------------|-------------|-------------|-------------|--------|------------------|
112
+ | VideoChat2 | 16 | 44.0 | 59.3 | 23.1 | 73.3 | **60.2** | 41.0 | 46.4 | 40.4 |
113
+ | VideoChat2 + DPO | 16 | 45.7 | 60.0 | 22.1 | 72.4 | 59.6 | 43.0 | 47.4 | 41.0 |
114
+ | VideoChat2 + **RRPO** | 16 | **45.8** | **60.2** | **32.9** | **76.4** | 59.0 | **44.3** | **47.9** | **42.8** |
115
+ | | | | | | | | | | |
116
+ | LLaVA-Video | 64 | 51.0 | 66.0 | 50.0 | 76.6 | 61.1 | 64.0 | 68.6 | 60.1 |
117
+ | LLaVA-Video + DPO | 64 | 51.9 | 66.4 | 53.3 | 76.5 | 60.6 | 63.1 | 67.4 | 59.4 |
118
+ | LLaVA-Video + **RRPO** | 64 | 51.9 | 66.8 | 55.7 | 76.5 | **62.2** | **64.5** | 69.1 | **60.4** |
119
+ | LLaVA-Video + **RRPO** (32f) | 64 | **52.2** | **67.4** | **55.8** | **76.6** | 62.1 | **64.5** | **69.4** | 60.1 |
120
+ | | | | | | | | | | |
121
+ | LongVU | 1fps | 53.7 | 63.9 | 39.2 | 67.3 | 65.5 | 56.2 | 63.6 | 48.6 |
122
+ | LongVU + DPO | 1fps | 54.3 | 64.3 | 40.9 | 68.5 | 65.9 | 56.6 | 63.6 | 49.4 |
123
+ | LongVU + **RRPO** | 1fps | **56.5** | **64.5** | **44.0** | **71.7** | **66.8** | **57.7** | **64.5** | **49.7** |
124
+
125
+
126
+ ## Evaluation
127
+
128
+ You can download evaluation benchmarks from the given links:
129
+
130
+ - [TVBench](https://huggingface.co/datasets/FunAILab/TVBench)
131
+ - [TempCompass](https://huggingface.co/datasets/lmms-lab/TempCompass)
132
+ - [VideoHallucer](https://huggingface.co/datasets/bigai-nlco/VideoHallucer)
133
+ - [VidHalluc](https://huggingface.co/datasets/chaoyuli/VidHalluc)
134
+ - [MVBench](https://huggingface.co/datasets/PKU-Alignment/MVBench)
135
+ - [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME)
136
+ - [MLVU](https://huggingface.co/datasets/MLVU/MVLU)
137
+ - [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench)
138
+
139
+ Next, you can run the entire evaluations following the instructions provided [here](./docs/EVALUATION.md).
140
+
141
+
142
+ ## Citation
143
+
144
+ If you find this work useful, please consider citing our paper:
145
+
146
+ ```
147
+ @article{sarkar2025rrpo,
148
+ title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
149
+ author={Your Name et al.},
150
+ journal={arXiv preprint arXiv:2504.12083},
151
+ year={2025}
152
+ }
153
+ ```
154
+
155
+ ## Usage and License Notices
156
+
157
+ This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses.
158
+ The assets used in this work include, but are not limited to:
159
+ [VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT),
160
+ [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B),
161
+ [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),
162
+ [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B). This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations.
163
+ This repository is released under the **Apache 2.0 License**. See [LICENSE](LICENSE) for details.
164
+
165
+ ---
166
+ For any issues or questions, please open an issue or contact **Pritam Sarkar** at [email protected]!