Text-to-Video
Diffusers
Safetensors
t2v
JumpingXL commited on
Commit
6ef9292
verified
1 Parent(s): 798aaed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -26
README.md CHANGED
@@ -10,37 +10,37 @@ license_link: LICENSE
10
  <h1 align="center">SkyReels V2: Infinite-Length Film Generative Model</h1>
11
 
12
  <p align="center">
13
- 馃搼 <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a> 路 馃憢 <a href="https://www.skyreels.ai/home?utm_campaign=huggingface_skyreels_v2" target="_blank">Playground</a> 路 馃挰 <a href="https://discord.gg/PwM6NYtccQ" target="_blank">Discord</a> 路 馃 <a href="https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9" target="_blank">Hugging Face</a> 路 馃 <a href="https://www.modelscope.cn/collections/SkyReels-V2-f665650130b144" target="_blank">ModelScope</a> 路 馃寪 <a href="https://github.com/SkyworkAI/SkyReels-V2" target="_blank">GitHub</a>
14
  </p>
15
 
16
  ---
17
- Welcome to the SkyReels V2 repository! Here, you'll find the model weights for our infinite-lenght film genetative models
18
 
19
 
20
  ## 馃敟馃敟馃敟 News!!
21
-
22
  * Apr 21, 2025: 馃憢 We release the inference code and model weights of [SkyReels-V2](https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9) Series Models and the video captioning model [SkyCaptioner-V1](https://huggingface.co/Skywork/SkyCaptioner-V1) .
23
  * Apr 3, 2025: 馃敟 We also release [SkyReels-A2](https://github.com/SkyworkAI/SkyReels-A2). This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
 
 
24
 
25
  ## 馃帴 Demos
26
  <table>
27
  <tr>
28
  <td align="center">
29
- <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63edffa3190ddd6214ef0116/fZkHmPGm_PM9lWM3aUoJy.mp4"></video>
30
  </td>
31
  <td align="center">
32
- <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63edffa3190ddd6214ef0116/xOKBbYAaslS--eEqv3HYY.mp4"></video>
33
  </td>
34
  <td align="center">
35
- <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63edffa3190ddd6214ef0116/3gU7UAwwAOt3wS2KwYlWL.mp4"></video>
36
  </td>
37
  </tr>
38
  </table>
39
  The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model.
40
 
41
 
42
-
43
-
44
  ## 馃搼 TODO List
45
 
46
  - [x] <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a>
@@ -81,7 +81,7 @@ You can download our models from Hugging Face:
81
  <td rowspan="5">Diffusion Forcing</td>
82
  <td>1.3B-540P</td>
83
  <td>544 * 960 * 97f</td>
84
- <td>馃 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B">Huggingface</a> 馃 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-1.3B">ModelScope</a></td>
85
  </tr>
86
  <tr>
87
  <td>5B-540P</td>
@@ -101,7 +101,7 @@ You can download our models from Hugging Face:
101
  <tr>
102
  <td>14B-720P</td>
103
  <td>720 * 1280 * 121f</td>
104
- <td>Coming Soon</td>
105
  </tr>
106
  <tr>
107
  <td rowspan="5">Text-to-Video</td>
@@ -133,7 +133,7 @@ You can download our models from Hugging Face:
133
  <td rowspan="5">Image-to-Video</td>
134
  <td>1.3B-540P</td>
135
  <td>544 * 960 * 97f</td>
136
- <td>馃 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B">Huggingface</a> 馃 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-1.3B">ModelScope</a></td>
137
  </tr>
138
  <tr>
139
  <td>5B-540P</td>
@@ -153,7 +153,7 @@ You can download our models from Hugging Face:
153
  <tr>
154
  <td>14B-720P</td>
155
  <td>720 * 1280 * 121f</td>
156
- <td>Coming Soon</td>
157
  </tr>
158
  <tr>
159
  <td rowspan="3">Camera Director</td>
@@ -179,10 +179,11 @@ After downloading, set the model path in your generation commands:
179
 
180
  #### Single GPU Inference
181
 
182
- - **Diffusion Forcing**
183
 
184
- The <a href="https://arxiv.org/abs/2407.01392">**Diffusion Forcing**</a> version model allows us to generate Infinite-Length videos. This model supports both **text-to-video (T2V)** and **image-to-video (I2V)** tasks, and it can perform inference in both synchronous and asynchronous modes.
185
 
 
186
  ```shell
187
  model_id=Skywork/SkyReels-V2-DF-14B-540P
188
  # synchronous inference
@@ -195,12 +196,36 @@ python3 generate_video_df.py \
195
  --overlap_history 17 \
196
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
197
  --addnoise_condition 20 \
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198
  --offload
199
  ```
 
200
  > **Note**:
201
- > - If you want to run the **image-to-video (I2V)** task, add `--image ${image_path}` to your command and it is also better to use **text-to-video (T2V)** prompt including the description of the first-frame image.
202
- > - You can use `--ar_step 5` to enable asynchronous inference. When asynchronous inference, `--causal_block_size 5` is recommanded.
203
- > - To reduce peak VRAM, lower the `--base_num_frames` for the same generative length `--num_frames`. This may slightly reduce video quality.
 
 
 
204
 
205
  - **Text To Video & Image To Video**
206
 
@@ -215,20 +240,27 @@ python3 generate_video.py \
215
  --shift 8.0 \
216
  --fps 24 \
217
  --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
218
- --offload
 
 
 
219
  ```
220
  > **Note**:
221
- > - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommanded for I2V model.
 
222
 
223
 
224
  - **Prompt Enhancer**
225
 
226
- The prompt enhancer is implemented based on <a href="https://huggingface.co/Qwen/Qwen2.5-32B-Instruct">Qwen2.5-32B-Instruct</a> and is utilized via the `--prompt_enhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if use `--prompt_enhancer`. If you want obtain the enhanced prompt separately, you can also run the prompt_dehancer script separately for testing. The steps are as follows:
227
 
228
  ```shell
229
  cd skyreels_v2_infer/pipelines
230
  python3 prompt_enhancer.py --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface."
231
  ```
 
 
 
232
 
233
  **Advanced Configuration Options**
234
 
@@ -248,7 +280,10 @@ Below are the key parameters you can customize for video generation:
248
  | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) |
249
  | --use_usp | True | Enables multi-GPU acceleration with xDiT USP |
250
  | --outdir | ./video_out | Directory where generated videos will be saved |
251
- | --prompt_enhancer | True | expand the prompt into a more detailed description |
 
 
 
252
 
253
  **Diffusion Forcing Additional Parameters**
254
  | Parameter | Recommended Value | Description |
@@ -273,7 +308,8 @@ torchrun --nproc_per_node=2 generate_video_df.py \
273
  --base_num_frames 97 \
274
  --num_frames 257 \
275
  --overlap_history 17 \
276
- --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
 
277
  --use_usp \
278
  --offload \
279
  --seed 42
@@ -295,7 +331,7 @@ torchrun --nproc_per_node=2 generate_video.py \
295
  --seed 42
296
  ```
297
  > **Note**:
298
- > - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommanded for I2V model.
299
 
300
 
301
  ## Contents
@@ -305,7 +341,7 @@ torchrun --nproc_per_node=2 generate_video.py \
305
  - [Video Captioner](#video-captioner)
306
  - [Reinforcement Learning](#reinforcement-learning)
307
  - [Diffusion Forcing](#diffusion-forcing)
308
- - [Hight-Quality Supervised Fine-Tuning(SFT)](#high-quality-supervised-fine-tuning-sft)
309
  - [Performance](#performance)
310
  - [Acknowledgements](#acknowledgements)
311
  - [Citation](#citation)
@@ -453,7 +489,7 @@ Inspired by the previous success in LLM, we propose to enhance the performance o
453
  - the generative model does not handle well with large, deformable motions.
454
  - the generated videos may violate the physical law.
455
 
456
- To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairsand human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motionquality of the generative model.
457
 
458
  #### Diffusion Forcing
459
 
@@ -676,7 +712,7 @@ We would like to thank the contributors of <a href="https://github.com/Wan-Video
676
  ```bibtex
677
  @misc{chen2025skyreelsv2infinitelengthfilmgenerative,
678
  title={SkyReels-V2: Infinite-length Film Generative Model},
679
- author={Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Juncheng Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengchen Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang Li and Yahui Zhou},
680
  year={2025},
681
  eprint={2504.13074},
682
  archivePrefix={arXiv},
 
10
  <h1 align="center">SkyReels V2: Infinite-Length Film Generative Model</h1>
11
 
12
  <p align="center">
13
+ 馃搼 <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a> 路 馃憢 <a href="https://www.skyreels.ai/home?utm_campaign=github_SkyReels_V2" target="_blank">Playground</a> 路 馃挰 <a href="https://discord.gg/PwM6NYtccQ" target="_blank">Discord</a> 路 馃 <a href="https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9" target="_blank">Hugging Face</a> 路 馃 <a href="https://www.modelscope.cn/collections/SkyReels-V2-f665650130b144" target="_blank">ModelScope</a>
14
  </p>
15
 
16
  ---
17
+ Welcome to the SkyReels V2 repository! Here, you'll find the model weights and inference code for our infinite-length film generative models
18
 
19
 
20
  ## 馃敟馃敟馃敟 News!!
21
+ * Apr 24, 2025: 馃敟 We release the 720P models, [SkyReels-V2-DF-14B-720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P) and [SkyReels-V2-I2V-14B-720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P). The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis.
22
  * Apr 21, 2025: 馃憢 We release the inference code and model weights of [SkyReels-V2](https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9) Series Models and the video captioning model [SkyCaptioner-V1](https://huggingface.co/Skywork/SkyCaptioner-V1) .
23
  * Apr 3, 2025: 馃敟 We also release [SkyReels-A2](https://github.com/SkyworkAI/SkyReels-A2). This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
24
+ * Feb 18, 2025: 馃敟 we released [SkyReels-A1](https://github.com/SkyworkAI/SkyReels-A1). This is an open-sourced and effective framework for portrait image animation.
25
+ * Feb 18, 2025: 馃敟 We released [SkyReels-V1](https://github.com/SkyworkAI/SkyReels-V1). This is the first and most advanced open-source human-centric video foundation model.
26
 
27
  ## 馃帴 Demos
28
  <table>
29
  <tr>
30
  <td align="center">
31
+ <video src="https://github.com/user-attachments/assets/f6f9f9a7-5d5f-433c-9d73-d8d593b7ad25" width="100%"></video>
32
  </td>
33
  <td align="center">
34
+ <video src="https://github.com/user-attachments/assets/0eb13415-f4d9-4aaf-bcd3-3031851109b9" width="100%"></video>
35
  </td>
36
  <td align="center">
37
+ <video src="https://github.com/user-attachments/assets/dcd16603-5bf4-4786-8e4d-1ed23889d07a" width="100%"></video>
38
  </td>
39
  </tr>
40
  </table>
41
  The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model.
42
 
43
 
 
 
44
  ## 馃搼 TODO List
45
 
46
  - [x] <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a>
 
81
  <td rowspan="5">Diffusion Forcing</td>
82
  <td>1.3B-540P</td>
83
  <td>544 * 960 * 97f</td>
84
+ <td>馃 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P">Huggingface</a> 馃 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-1.3B-540P">ModelScope</a></td>
85
  </tr>
86
  <tr>
87
  <td>5B-540P</td>
 
101
  <tr>
102
  <td>14B-720P</td>
103
  <td>720 * 1280 * 121f</td>
104
+ <td>馃 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P">Huggingface</a> 馃 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-14B-720P">ModelScope</a></td>
105
  </tr>
106
  <tr>
107
  <td rowspan="5">Text-to-Video</td>
 
133
  <td rowspan="5">Image-to-Video</td>
134
  <td>1.3B-540P</td>
135
  <td>544 * 960 * 97f</td>
136
+ <td>馃 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P">Huggingface</a> 馃 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-1.3B-540P">ModelScope</a></td>
137
  </tr>
138
  <tr>
139
  <td>5B-540P</td>
 
153
  <tr>
154
  <td>14B-720P</td>
155
  <td>720 * 1280 * 121f</td>
156
+ <td>馃 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P">Huggingface</a> 馃 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-14B-720P">ModelScope</a></td>
157
  </tr>
158
  <tr>
159
  <td rowspan="3">Camera Director</td>
 
179
 
180
  #### Single GPU Inference
181
 
182
+ - **Diffusion Forcing for Long Video Generation**
183
 
184
+ The <a href="https://arxiv.org/abs/2407.01392">**Diffusion Forcing**</a> version model allows us to generate Infinite-Length videos. This model supports both **text-to-video (T2V)** and **image-to-video (I2V)** tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.
185
 
186
+ synchronous generation for 10s video
187
  ```shell
188
  model_id=Skywork/SkyReels-V2-DF-14B-540P
189
  # synchronous inference
 
196
  --overlap_history 17 \
197
  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
198
  --addnoise_condition 20 \
199
+ --offload \
200
+ --teacache \
201
+ --use_ret_steps \
202
+ --teacache_thresh 0.3
203
+ ```
204
+
205
+ asynchronous generation for 30s video
206
+ ```shell
207
+ model_id=Skywork/SkyReels-V2-DF-14B-540P
208
+ # asynchronous inference
209
+ python3 generate_video_df.py \
210
+ --model_id ${model_id} \
211
+ --resolution 540P \
212
+ --ar_step 5 \
213
+ --causal_block_size 5 \
214
+ --base_num_frames 97 \
215
+ --num_frames 737 \
216
+ --overlap_history 17 \
217
+ --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
218
+ --addnoise_condition 20 \
219
  --offload
220
  ```
221
+
222
  > **Note**:
223
+ > - If you want to run the **image-to-video (I2V)** task, add `--image ${image_path}` to your command and it is also better to use **text-to-video (T2V)**-like prompt which includes some descriptions of the first-frame image.
224
+ > - For long video generation, you can just switch the `--num_frames`, e.g., `--num_frames 257` for 10s video, `--num_frames 377` for 15s video, `--num_frames 737` for 30s video, `--num_frames 1457` for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causal_block_size > 1, the `--num_frames` should be carefully set.
225
+ > - You can use `--ar_step 5` to enable asynchronous inference. When asynchronous inference, `--causal_block_size 5` is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97-1)//4+1=25 for base_num_frames=97) and (e.g., (237-97-(97-17)x1+17-1)//4+1=20 for base_num_frames=97, num_frames=237, overlap_history=17) for the last iteration, MUST be divided by causal_block_size. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance.
226
+ > - To reduce peak VRAM, just lower the `--base_num_frames`, e.g., to 77 or 57, while keeping the same generative length `--num_frames` you want to generate. This may slightly reduce video quality, and it should not be set too small.
227
+ > - `--addnoise_condition` is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50.
228
+ > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM.
229
 
230
  - **Text To Video & Image To Video**
231
 
 
240
  --shift 8.0 \
241
  --fps 24 \
242
  --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
243
+ --offload \
244
+ --teacache \
245
+ --use_ret_steps \
246
+ --teacache_thresh 0.3
247
  ```
248
  > **Note**:
249
+ > - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommended for I2V model.
250
+ > - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM.
251
 
252
 
253
  - **Prompt Enhancer**
254
 
255
+ The prompt enhancer is implemented based on <a href="https://huggingface.co/Qwen/Qwen2.5-32B-Instruct">Qwen2.5-32B-Instruct</a> and is utilized via the `--prompt_enhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if you use `--prompt_enhancer`. If you want to obtain the enhanced prompt separately, you can also run the prompt_enhancer script separately for testing. The steps are as follows:
256
 
257
  ```shell
258
  cd skyreels_v2_infer/pipelines
259
  python3 prompt_enhancer.py --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface."
260
  ```
261
+ > **Note**:
262
+ > - `--prompt_enhancer` is not allowed if using `--use_usp`. We recommend running the skyreels_v2_infer/pipelines/prompt_enhancer.py script first to generate enhanced prompt before enabling the `--use_usp` parameter.
263
+
264
 
265
  **Advanced Configuration Options**
266
 
 
280
  | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) |
281
  | --use_usp | True | Enables multi-GPU acceleration with xDiT USP |
282
  | --outdir | ./video_out | Directory where generated videos will be saved |
283
+ | --prompt_enhancer | True | Expand the prompt into a more detailed description |
284
+ | --teacache | False | Enables teacache for faster inference |
285
+ | --teacache_thresh | 0.2 | Higher speedup will cause to worse quality |
286
+ | --use_ret_steps | False | Retention Steps for teacache |
287
 
288
  **Diffusion Forcing Additional Parameters**
289
  | Parameter | Recommended Value | Description |
 
308
  --base_num_frames 97 \
309
  --num_frames 257 \
310
  --overlap_history 17 \
311
+ --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
312
+ --addnoise_condition 20 \
313
  --use_usp \
314
  --offload \
315
  --seed 42
 
331
  --seed 42
332
  ```
333
  > **Note**:
334
+ > - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommended for I2V model.
335
 
336
 
337
  ## Contents
 
341
  - [Video Captioner](#video-captioner)
342
  - [Reinforcement Learning](#reinforcement-learning)
343
  - [Diffusion Forcing](#diffusion-forcing)
344
+ - [High-Quality Supervised Fine-Tuning(SFT)](#high-quality-supervised-fine-tuning-sft)
345
  - [Performance](#performance)
346
  - [Acknowledgements](#acknowledgements)
347
  - [Citation](#citation)
 
489
  - the generative model does not handle well with large, deformable motions.
490
  - the generated videos may violate the physical law.
491
 
492
+ To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairs and human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motion quality of the generative model.
493
 
494
  #### Diffusion Forcing
495
 
 
712
  ```bibtex
713
  @misc{chen2025skyreelsv2infinitelengthfilmgenerative,
714
  title={SkyReels-V2: Infinite-length Film Generative Model},
715
+ author={Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Junchen Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengcheng Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang Li and Yahui Zhou},
716
  year={2025},
717
  eprint={2504.13074},
718
  archivePrefix={arXiv},