OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Abstract
Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The previous popular video datasets, e.g. WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt. To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million text-video pairs, facilitating research on T2V generation. Furthermore, we curate 433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing high-definition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.
Community
Paper, code, dataset and models are all released.
paper: http://export.arxiv.org/pdf/2407.02371
project website: https://nju-pcalab.github.io/projects/openvid
code: https://github.com/NJU-PCALab/OpenVid-1M
dataset: https://huggingface.co/datasets/nkp37/OpenVid-1M
models: https://huggingface.co/datasets/nkp37/OpenVid-1M/tree/main/model_weights
Text-to-video (T2V) generation has recently garnered significant attention thanks to the large multi-modality model Sora. However, T2V generation still faces two important challenges: 1) Lacking a precise open sourced high-quality dataset. The
previous popular video datasets, e.g.WebVid-10M and Panda-70M, are either with low quality or too large for most research institutions. Therefore, it is challenging but crucial to collect a precise high-quality text-video pairs for T2V generation. 2) Ignoring to fully utilize textual information. Recent T2V methods have focused on vision transformers, using a simple cross attention module for video generation, which falls short of thoroughly extracting semantic information from text prompt.
To address these issues, we introduce OpenVid-1M, a precise high-quality dataset with expressive captions. This open-scenario dataset contains over 1 million textvideo pairs, facilitating research on T2V generation. Furthermore, we curate
433K 1080p videos from OpenVid-1M to create OpenVidHD-0.4M, advancing highdefinition video generation. Additionally, we propose a novel Multi-modal Video Diffusion Transformer (MVDiT) capable of mining both structure information from
visual tokens and semantic information from text tokens. Extensive experiments and ablation studies verify the superiority of OpenVid-1M over previous datasets and the effectiveness of our MVDiT.
Thanks!
Hi @yingtai congrats on this work!
Great to see you're making the dataset and models available on HF.
Would you be able to link the dataset to this paper? Here's how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.
I also saw the models are currently part of the dataset repo, would you be able to create model repositories for them instead (so that they appear as models citing this paper)? Here's how to do that: https://huggingface.co/docs/hub/models-uploading. They can be linked to the paper as explained here.
Thanks for your suggestions!
We will link the dataset and create mode repo in recent days!
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper