File size: 6,471 Bytes
07c1728 bfcbb5a 7c2d0d7 07c1728 bfcbb5a 07c1728 bfcbb5a 07c1728 7c2d0d7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
base_model:
- OpenGVLab/VideoChat2_stage3_Mistral_7B
license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers
---
<a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
<a href='https://pritamqu.github.io/RRPO/'><img src='https://img.shields.io/badge/project-RRPO-blue'></a>
<a href='https://huggingface.co/datasets/pritamqu/self-alignment'><img src='https://img.shields.io/badge/huggingface-datasets-green'></a>
<a href='https://huggingface.co/collections/pritamqu/rrpo-67fbc8c048b298a5fdfb167b'><img src='https://img.shields.io/badge/model-checkpoints-yellow'></a>
</a><a href='https://github.com/pritamqu/RRPO'><img src='https://img.shields.io/badge/github-repository-purple'></a>
## Installation
Clone the repository and navigate to the RRPO directory:
```sh
git clone https://github.com/pritamqu/RRPO
cd RRPO
conda create -n videochat2 python=3.10 -y
conda activate videochat2
pip install -r videochat2.txt
```
## Download weights
```
# base model
git clone [email protected]:OpenGVLab/VideoChat2_stage3_Mistral_7B
# RRPO weights
git clone [email protected]:pritamqu/VideoChat2_stage3_Mistral_7B-RRPO-16f-LORA
```
## Inference
```
conda activate videochat2
BASE_WEIGHTS="./VideoChat2_stage3_Mistral_7B"
WEIGHTS_ROOT="./"
python inference.py \
--base_model_name "videochat2_mistral_7b" \
--model-path ${BASE_WEIGHTS} \
--model-path2 ${WEIGHTS_ROOT}"/VideoChat2_stage3_Mistral_7B-RRPO-16f-LORA" \
--video_path "sample_video.mp4" \
--question "Describe this video." \
--model_max_length 1024
```
## Dataset
Our training data is released here [Self-Alignment Dataset](https://huggingface.co/datasets/pritamqu/self-alignment). We release the preferred and non-preferred responses used in self-alignment training.
```
git clone [email protected]:datasets/pritamqu/self-alignment
```
The related videos can be downloaded from their original sources. Please check [VideoChat-IT](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) GitHub page regarding the details of downloading the source videos.
We also share additional details on how to use your own data [here](docs/DATA.md).
## Training
Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:
VideoChat2
```
bash scripts/videochat2/run.sh
```
LLaVA-Video
```
bash scripts/llavavideo/run.sh
```
LongVU
```
bash scripts/longvu/run.sh
```
The link to the base model weights are:
- [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B)
- [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2)
- [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B)
## Inference
We provide a simple setup to inference using our trained model.
**VideoChat2**
```
bash scripts/inference_videochat2.sh
```
**LLaVA-Video**
```
bash scripts/inference_llavavideo.sh
```
**LongVU**
```
bash scripts/inference_longvu.sh
```
## Results
**RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.**
| **Models** | **#F** | **TV Bench** | **Temp Compass** | **Video Hallucer** | **Vid Halluc** | **MV Bench** | **Video MME** | **MLVU** | **LongVideo Bench** |
|------------|------|-------------|----------------|----------------|-------------|-------------|-------------|--------|------------------|
| VideoChat2 | 16 | 44.0 | 59.3 | 23.1 | 73.3 | **60.2** | 41.0 | 46.4 | 40.4 |
| VideoChat2 + DPO | 16 | 45.7 | 60.0 | 22.1 | 72.4 | 59.6 | 43.0 | 47.4 | 41.0 |
| VideoChat2 + **RRPO** | 16 | **45.8** | **60.2** | **32.9** | **76.4** | 59.0 | **44.3** | **47.9** | **42.8** |
| | | | | | | | | | |
| LLaVA-Video | 64 | 51.0 | 66.0 | 50.0 | 76.6 | 61.1 | 64.0 | 68.6 | 60.1 |
| LLaVA-Video + DPO | 64 | 51.9 | 66.4 | 53.3 | 76.5 | 60.6 | 63.1 | 67.4 | 59.4 |
| LLaVA-Video + **RRPO** | 64 | 51.9 | 66.8 | 55.7 | 76.5 | **62.2** | **64.5** | 69.1 | **60.4** |
| LLaVA-Video + **RRPO** (32f) | 64 | **52.2** | **67.4** | **55.8** | **76.6** | 62.1 | **64.5** | **69.4** | 60.1 |
| | | | | | | | | | |
| LongVU | 1fps | 53.7 | 63.9 | 39.2 | 67.3 | 65.5 | 56.2 | 63.6 | 48.6 |
| LongVU + DPO | 1fps | 54.3 | 64.3 | 40.9 | 68.5 | 65.9 | 56.6 | 63.6 | 49.4 |
| LongVU + **RRPO** | 1fps | **56.5** | **64.5** | **44.0** | **71.7** | **66.8** | **57.7** | **64.5** | **49.7** |
## Evaluation
You can download evaluation benchmarks from the given links:
- [TVBench](https://huggingface.co/datasets/FunAILab/TVBench)
- [TempCompass](https://huggingface.co/datasets/lmms-lab/TempCompass)
- [VideoHallucer](https://huggingface.co/datasets/bigai-nlco/VideoHallucer)
- [VidHalluc](https://huggingface.co/datasets/chaoyuli/VidHalluc)
- [MVBench](https://huggingface.co/datasets/PKU-Alignment/MVBench)
- [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME)
- [MLVU](https://huggingface.co/datasets/MLVU/MVLU)
- [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench)
Next, you can run the entire evaluations following the instructions provided [here](./docs/EVALUATION.md).
## Citation
If you find this work useful, please consider citing our paper:
```
@article{sarkar2025rrpo,
title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
author={Your Name et al.},
journal={arXiv preprint arXiv:2504.12083},
year={2025}
}
```
## Usage and License Notices
This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses.
The assets used in this work include, but are not limited to:
[VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT),
[VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B),
[LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),
[LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B). This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations.
This repository is released under the **Apache 2.0 License**. See [LICENSE](LICENSE) for details.
---
For any issues or questions, please open an issue or contact **Pritam Sarkar** at [email protected]! |