pritamqu
/

VideoChat2_stage3_Mistral_7B-RRPO-16f-LORA

Model card Files Files and versions Community

nielsr HF Staff commited on Apr 22

Commit

7c2d0d7

verified ·

1 Parent(s): bfcbb5a

Add pipeline tag and library name, include dataset, training, and evaluation information

Browse files

This PR adds the `pipeline_tag` and `library_name` to the model card metadata. The `pipeline_tag` is set to `video-text-to-text` based on the model's functionality as described in the paper and the provided code examples. The `library_name` is set to `transformers` given the usage in the provided code snippets. It also includes the dataset used to train the model, instructions for training the model, results and information about how to evaluate the model.

Files changed (1) hide show

README.md +117 -2

README.md CHANGED Viewed

@@ -1,7 +1,9 @@
 ---
-license: apache-2.0
 base_model:
 - OpenGVLab/VideoChat2_stage3_Mistral_7B
 ---
 <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
@@ -48,4 +50,117 @@ python inference.py \
     --question "Describe this video." \
     --model_max_length 1024
-```

 ---
 base_model:
 - OpenGVLab/VideoChat2_stage3_Mistral_7B
+license: apache-2.0
+pipeline_tag: video-text-to-text
+library_name: transformers
 ---
 <a href='https://arxiv.org/abs/2504.12083'><img src='https://img.shields.io/badge/arXiv-paper-red'></a>
     --question "Describe this video." \
     --model_max_length 1024
+```
+## Dataset
+Our training data is released here [Self-Alignment Dataset](https://huggingface.co/datasets/pritamqu/self-alignment). We release the preferred and non-preferred responses used in self-alignment training.
+```
+git clone [email protected]:datasets/pritamqu/self-alignment
+```
+The related videos can be downloaded from their original sources. Please check [VideoChat-IT](https://github.com/OpenGVLab/Ask-Anything/blob/main/video_chat2/DATA.md) GitHub page regarding the details of downloading the source videos.
+We also share additional details on how to use your own data [here](docs/DATA.md).
+## Training
+Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:
+VideoChat2
+```
+bash scripts/videochat2/run.sh
+```
+LLaVA-Video
+```
+bash scripts/llavavideo/run.sh
+```
+LongVU
+```
+bash scripts/longvu/run.sh
+```
+The link to the base model weights are:
+- [VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B)
+- [LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2)
+- [LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B)
+## Inference
+We provide a simple setup to inference using our trained model.
+**VideoChat2**
+```
+bash scripts/inference_videochat2.sh
+```
+**LLaVA-Video**
+```
+bash scripts/inference_llavavideo.sh
+```
+**LongVU**
+```
+bash scripts/inference_longvu.sh
+```
+## Results
+**RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.**
+| **Models** | **#F** | **TV Bench** | **Temp Compass** | **Video Hallucer** | **Vid Halluc** | **MV Bench** | **Video MME** | **MLVU** | **LongVideo Bench** |
+|------------|------|-------------|----------------|----------------|-------------|-------------|-------------|--------|------------------|
+| VideoChat2 | 16 | 44.0 | 59.3 | 23.1 | 73.3 | **60.2** | 41.0 | 46.4 | 40.4 |
+| VideoChat2 + DPO | 16 | 45.7 | 60.0 | 22.1 | 72.4 | 59.6 | 43.0 | 47.4 | 41.0 |
+| VideoChat2 + **RRPO** | 16 | **45.8** | **60.2** | **32.9** | **76.4** | 59.0 | **44.3** | **47.9** | **42.8** |
+|  |  |  |  |  |  |  |  |  |  |
+| LLaVA-Video | 64 | 51.0 | 66.0 | 50.0 | 76.6 | 61.1 | 64.0 | 68.6 | 60.1 |
+| LLaVA-Video + DPO | 64 | 51.9 | 66.4 | 53.3 | 76.5 | 60.6 | 63.1 | 67.4 | 59.4 |
+| LLaVA-Video + **RRPO** | 64 | 51.9 | 66.8 | 55.7 | 76.5 | **62.2** | **64.5** | 69.1 | **60.4** |
+| LLaVA-Video + **RRPO** (32f) | 64 | **52.2** | **67.4** | **55.8** | **76.6** | 62.1 | **64.5** | **69.4** | 60.1 |
+|  |  |  |  |  |  |  |  |  |  |
+| LongVU | 1fps | 53.7 | 63.9 | 39.2 | 67.3 | 65.5 | 56.2 | 63.6 | 48.6 |
+| LongVU + DPO | 1fps | 54.3 | 64.3 | 40.9 | 68.5 | 65.9 | 56.6 | 63.6 | 49.4 |
+| LongVU + **RRPO** | 1fps | **56.5** | **64.5** | **44.0** | **71.7** | **66.8** | **57.7** | **64.5** | **49.7** |
+## Evaluation
+You can download evaluation benchmarks from the given links:
+- [TVBench](https://huggingface.co/datasets/FunAILab/TVBench)
+- [TempCompass](https://huggingface.co/datasets/lmms-lab/TempCompass)
+- [VideoHallucer](https://huggingface.co/datasets/bigai-nlco/VideoHallucer)
+- [VidHalluc](https://huggingface.co/datasets/chaoyuli/VidHalluc)
+- [MVBench](https://huggingface.co/datasets/PKU-Alignment/MVBench)
+- [VideoMME](https://huggingface.co/datasets/lmms-lab/Video-MME)
+- [MLVU](https://huggingface.co/datasets/MLVU/MVLU)
+- [LongVideoBench](https://huggingface.co/datasets/longvideobench/LongVideoBench)
+Next, you can run the entire evaluations following the instructions provided [here](./docs/EVALUATION.md).
+## Citation
+If you find this work useful, please consider citing our paper:
+```
+@article{sarkar2025rrpo,
+  title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
+  author={Your Name et al.},
+  journal={arXiv preprint arXiv:2504.12083},
+  year={2025}
+}
+```
+## Usage and License Notices
+This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses.
+The assets used in this work include, but are not limited to:
+[VideoChat2-IT](https://huggingface.co/datasets/OpenGVLab/VideoChat2-IT),
+[VideoChat2_stage3_Mistral_7B](https://huggingface.co/OpenGVLab/VideoChat2_stage3_Mistral_7B),
+[LLaVA-Video-7B-Qwen2](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2),
+[LongVU_Qwen2_7B](https://huggingface.co/Vision-CAIR/LongVU_Qwen2_7B). This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations.
+This repository is released under the **Apache 2.0 License**. See [LICENSE](LICENSE) for details.
+---
+For any issues or questions, please open an issue or contact **Pritam Sarkar** at [email protected]!