PengWeixuanSZU commited on
Commit
7422dfc
·
verified ·
1 Parent(s): 6ef46d6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -3
README.md CHANGED
@@ -1,3 +1,84 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists
2
+
3
+ ![Visitor Count](https://komarev.com/ghpvc/?username=zibojia&repo=SENORITA&label=visitors)
4
+ [![Model](https://img.shields.io/badge/HuggingFace-Model-blue)](https://huggingface.co/PengWeixuanSZU/Senorita-2M)
5
+ [![Demo Page](https://img.shields.io/badge/Website-Demo%20Page-green)](https://senorita-2m-dataset.github.io/)
6
+ [![Dataset](https://img.shields.io/badge/HuggingFace-Dataset-orange)](https://huggingface.co/datasets/SENORITADATASET/Senorita)
7
+
8
+ ## Overview
9
+
10
+ Señorita-2M is a comprehensive and high-quality dataset designed for general video editing tasks. It consists of a vast collection of videos with detailed instructions provided by video specialists.
11
+
12
+ ## Abstract
13
+
14
+ Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still face challenges in quality and efficiency.
15
+
16
+ ## Key Features
17
+
18
+ - **High-Quality Annotations**: Each video in the dataset is accompanied by precise and detailed instructions from professional video editors.
19
+ - **Diverse Editing Tasks**: The dataset covers a wide range of video editing tasks, including object removal, object swap, global and local stylization.
20
+ - **Large Scale**: With over **2 million** video clips, Señorita-2M is one of the largest video editing datasets available.
21
+
22
+ ## Dataset Construction
23
+
24
+ We built the dataset by leveraging high-quality video editing experts. Specifically, we trained four high-quality video editing experts using CogVideoX: a global stylizer, a local stylizer, an inpainting expert, and a super-resolution expert.
25
+
26
+ Furthermore, we trained multiple video editors based on different video editing architectures using this dataset to evaluate the effectiveness of various editing frameworks, ultimately achieving impressive results.
27
+
28
+ ## Editing Tasks
29
+
30
+ Our dataset consists of **18** editing tasks. Five of these tasks are edited by our trained experts, while the remaining tasks are handled by computer vision tasks. The former sub-dataset occupies around 70% of the total dataset size.
31
+
32
+ ## Paper Content
33
+
34
+ ### Dataset Construction Pipeline
35
+ ![Dataset Construction Pipeline](images/teaser.PNG)
36
+ The dataset construction pipeline involves several stages, including data collection, annotation, and quality verification. We crawled videos from Pexels, a video-sharing website with high-resolution and high-quality videos, by authenticated APIs. The total number of videos in this part is around 390,000. Each video clip is meticulously annotated by video specialists to ensure the highest quality. The captioning of videos is handled by BLIP-2 to cater to the length restriction of CLIP, while the mask regions and their corresponding phrases are obtained by CogVLM2 and Grounded-SAM2.
37
+
38
+ ### Global Stylization
39
+ ![Global Stylization](images/global_stylization.PNG)
40
+ Global stylization involves applying a consistent style across the entire video. This task is performed by the global stylizer trained using CogVideoX, which ensures a uniform look and feel throughout the video. The video ControlNet uses multiple control conditions to get robust style transfer results, including Canny, HED, and Depth, each transformed into latent space via 3D-VAE.
41
+
42
+ ### Local Stylization
43
+ ![Local Stylization](images/local_stylization.PNG)
44
+ Local stylization focuses on specific regions within the video, allowing for more detailed and localized effects. Inspired by the inpainting methods, such as AVID, we trained a local stylizer using both inpainting and ControlNet. The model uses three control conditions, same as the global stylizer, inputted into the ControlNet branch. Besides, the mask conditions are fed into the main branch. The pretrained model used is CogVideoX-2B.
45
+
46
+ ### Object Removal
47
+ ![Object Removal](images/object_removal.PNG)
48
+ Object removal is a common video editing task where unwanted objects are seamlessly removed from the video. Our inpainting expert is trained to handle this task efficiently, ensuring that the background is accurately reconstructed. Current video inpainters like Propinater generate blur when removing objects, which highly reduces its usability. Thus, we trained a powerful video remover based on CogVideoX-2B, using a novel mask selection strategy.
49
+
50
+ ### Object Swap
51
+ ![Object Swap](images/object_swap.PNG)
52
+ Object swap involves replacing one object with another within the video. This complex task is managed by our trained video editors, who ensure that the new object blends seamlessly with the surrounding environment. Object swap uses FLUX-Fill and our trained inpainter. To begin with, the LLaMA-3 suggests a replacement object, which is then swapped in the first frame by FLUX-Fill. The inpainter generates the remaining frames guided by the first.
53
+
54
+ ## Citation
55
+
56
+ If you use Señorita-2M in your research, please cite our work as follows:
57
+
58
+ ```
59
+ @article{zi2025senorita,
60
+ title={Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists},
61
+ author={Bojia Zi and Penghui Ruan and Marco Chen and Xianbiao Qi and Shaozhe Hao and Shihao Zhao and Youze Huang and Bin Liang and Rong Xiao and Kam-Fai Wong},
62
+ journal={arXiv preprint arXiv:2502.06734},
63
+ year={2025},
64
+ }
65
+ ```
66
+
67
+ ## Authors
68
+
69
+ - Bojia Zi, The Chinese University of Hong Kong
70
+ - Penghui Ruan, The Hong Kong Polytechnic University
71
+ - Marco Chen, Tsinghua University
72
+ - Xianbiao Qi, IntelliFusion Inc.
73
+ - Shaozhe Hao, The University of Hong Kong
74
+ - Shihao Zhao, The University of Hong Kong
75
+ - Youze Huang, University of Electronic Science and Technology of China
76
+ - Bin Liang, The Chinese University of Hong Kong
77
+ - Rong Xiao, IntelliFusion Inc.
78
+ - Kam-Fai Wong, The Chinese University of Hong Kong
79
+
80
+ **Note**: * indicates equal contribution. † indicates the corresponding author.
81
+
82
+ ## Contact
83
+
84
+ For more information or any queries regarding the dataset, please contact us at [[email protected]]([email protected]).