nielsr's picture
nielsr HF Staff
Add pipeline tag and library name, include dataset, training, and evaluation information
7c2d0d7 verified
|
raw
history blame
6.47 kB
metadata
base_model:
  - OpenGVLab/VideoChat2_stage3_Mistral_7B
license: apache-2.0
pipeline_tag: video-text-to-text
library_name: transformers

Installation

Clone the repository and navigate to the RRPO directory:

git clone https://github.com/pritamqu/RRPO
cd RRPO

conda create -n videochat2 python=3.10 -y
conda activate videochat2
pip install -r videochat2.txt

Download weights

# base model
git clone [email protected]:OpenGVLab/VideoChat2_stage3_Mistral_7B
# RRPO weights
git clone [email protected]:pritamqu/VideoChat2_stage3_Mistral_7B-RRPO-16f-LORA

Inference

conda activate videochat2
BASE_WEIGHTS="./VideoChat2_stage3_Mistral_7B"
WEIGHTS_ROOT="./"

python inference.py \
    --base_model_name "videochat2_mistral_7b" \
    --model-path ${BASE_WEIGHTS} \
    --model-path2 ${WEIGHTS_ROOT}"/VideoChat2_stage3_Mistral_7B-RRPO-16f-LORA" \
    --video_path "sample_video.mp4" \
    --question "Describe this video." \
    --model_max_length 1024

Dataset

Our training data is released here Self-Alignment Dataset. We release the preferred and non-preferred responses used in self-alignment training.

git clone [email protected]:datasets/pritamqu/self-alignment

The related videos can be downloaded from their original sources. Please check VideoChat-IT GitHub page regarding the details of downloading the source videos.

We also share additional details on how to use your own data here.

Training

Before training, make sure to prepare the data and download the weights of the base models. Then you can launch the training jobs as:

VideoChat2

bash scripts/videochat2/run.sh

LLaVA-Video

bash scripts/llavavideo/run.sh

LongVU

bash scripts/longvu/run.sh

The link to the base model weights are:

Inference

We provide a simple setup to inference using our trained model.

VideoChat2

bash scripts/inference_videochat2.sh

LLaVA-Video

bash scripts/inference_llavavideo.sh

LongVU

bash scripts/inference_longvu.sh

Results

RRPO shows consistent improvements over the base model and outperforms DPO across all benchmarks.

Models #F TV Bench Temp Compass Video Hallucer Vid Halluc MV Bench Video MME MLVU LongVideo Bench
VideoChat2 16 44.0 59.3 23.1 73.3 60.2 41.0 46.4 40.4
VideoChat2 + DPO 16 45.7 60.0 22.1 72.4 59.6 43.0 47.4 41.0
VideoChat2 + RRPO 16 45.8 60.2 32.9 76.4 59.0 44.3 47.9 42.8
LLaVA-Video 64 51.0 66.0 50.0 76.6 61.1 64.0 68.6 60.1
LLaVA-Video + DPO 64 51.9 66.4 53.3 76.5 60.6 63.1 67.4 59.4
LLaVA-Video + RRPO 64 51.9 66.8 55.7 76.5 62.2 64.5 69.1 60.4
LLaVA-Video + RRPO (32f) 64 52.2 67.4 55.8 76.6 62.1 64.5 69.4 60.1
LongVU 1fps 53.7 63.9 39.2 67.3 65.5 56.2 63.6 48.6
LongVU + DPO 1fps 54.3 64.3 40.9 68.5 65.9 56.6 63.6 49.4
LongVU + RRPO 1fps 56.5 64.5 44.0 71.7 66.8 57.7 64.5 49.7

Evaluation

You can download evaluation benchmarks from the given links:

Next, you can run the entire evaluations following the instructions provided here.

Citation

If you find this work useful, please consider citing our paper:

@article{sarkar2025rrpo,
  title={Self-Alignment of Large Video Language Models with Refined Regularized Preference Optimization},
  author={Your Name et al.},
  journal={arXiv preprint arXiv:2504.12083},
  year={2025}
}

Usage and License Notices

This project incorporates datasets and model checkpoints that are subject to their respective original licenses. Users must adhere to the terms and conditions specified by these licenses. The assets used in this work include, but are not limited to: VideoChat2-IT, VideoChat2_stage3_Mistral_7B, LLaVA-Video-7B-Qwen2, LongVU_Qwen2_7B. This project does not impose any additional constraints beyond those stipulated in the original licenses. Users must ensure their usage complies with all applicable laws and regulations. This repository is released under the Apache 2.0 License. See LICENSE for details.


For any issues or questions, please open an issue or contact Pritam Sarkar at [email protected]!