Skywork
/

SkyReels-V2-DF-14B-720P

Text-to-Video

Diffusers

Safetensors

t2v

Model card Files Files and versions Community

JumpingXL commited on Apr 24

Commit

6ef9292

verified ·

1 Parent(s): 798aaed

Update README.md

Browse files

Files changed (1) hide show

README.md +62 -26

README.md CHANGED Viewed

@@ -10,37 +10,37 @@ license_link: LICENSE
 <h1 align="center">SkyReels V2: Infinite-Length Film Generative Model</h1>
 <p align="center">
-📑 <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a> · 👋 <a href="https://www.skyreels.ai/home?utm_campaign=huggingface_skyreels_v2" target="_blank">Playground</a> · 💬 <a href="https://discord.gg/PwM6NYtccQ" target="_blank">Discord</a> · 🤗 <a href="https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9" target="_blank">Hugging Face</a> · 🤖 <a href="https://www.modelscope.cn/collections/SkyReels-V2-f665650130b144" target="_blank">ModelScope</a> · 🌐 <a href="https://github.com/SkyworkAI/SkyReels-V2" target="_blank">GitHub</a>
 </p>
 ---
-Welcome to the SkyReels V2 repository! Here, you'll find the model weights for our infinite-lenght film genetative models
 ## 🔥🔥🔥 News!!
 * Apr 21, 2025: 👋 We release the inference code and model weights of [SkyReels-V2](https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9) Series Models and the video captioning model [SkyCaptioner-V1](https://huggingface.co/Skywork/SkyCaptioner-V1) .
 * Apr 3, 2025: 🔥 We also release [SkyReels-A2](https://github.com/SkyworkAI/SkyReels-A2). This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
 ## 🎥 Demos
 <table>
   <tr>
     <td align="center">
-      <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63edffa3190ddd6214ef0116/fZkHmPGm_PM9lWM3aUoJy.mp4"></video>
     </td>
     <td align="center">
-      <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63edffa3190ddd6214ef0116/xOKBbYAaslS--eEqv3HYY.mp4"></video>
     </td>
     <td align="center">
-      <video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63edffa3190ddd6214ef0116/3gU7UAwwAOt3wS2KwYlWL.mp4"></video>
     </td>
   </tr>
 </table>
 The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model.
 ## 📑 TODO List
 - [x] <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a>
@@ -81,7 +81,7 @@ You can download our models from Hugging Face:
       <td rowspan="5">Diffusion Forcing</td>
       <td>1.3B-540P</td>
       <td>544 * 960 * 97f</td>
-      <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-1.3B">ModelScope</a></td>
     </tr>
     <tr>
       <td>5B-540P</td>
@@ -101,7 +101,7 @@ You can download our models from Hugging Face:
     <tr>
       <td>14B-720P</td>
       <td>720 * 1280 * 121f</td>
-      <td>Coming Soon</td>
     </tr>
     <tr>
       <td rowspan="5">Text-to-Video</td>
@@ -133,7 +133,7 @@ You can download our models from Hugging Face:
       <td rowspan="5">Image-to-Video</td>
       <td>1.3B-540P</td>
       <td>544 * 960 * 97f</td>
-      <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-1.3B">ModelScope</a></td>
     </tr>
     <tr>
       <td>5B-540P</td>
@@ -153,7 +153,7 @@ You can download our models from Hugging Face:
     <tr>
       <td>14B-720P</td>
       <td>720 * 1280 * 121f</td>
-      <td>Coming Soon</td>
     </tr>
     <tr>
       <td rowspan="3">Camera Director</td>
@@ -179,10 +179,11 @@ After downloading, set the model path in your generation commands:
 #### Single GPU Inference
-- **Diffusion Forcing**
-The <a href="https://arxiv.org/abs/2407.01392">**Diffusion Forcing**</a> version model allows us to generate Infinite-Length videos. This model supports both **text-to-video (T2V)** and **image-to-video (I2V)** tasks, and it can perform inference in both synchronous and asynchronous modes.
 ```shell
 model_id=Skywork/SkyReels-V2-DF-14B-540P
 # synchronous inference
@@ -195,12 +196,36 @@ python3 generate_video_df.py \
   --overlap_history 17 \
   --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
   --addnoise_condition 20 \
   --offload
 ```
 > **Note**:
-> - If you want to run the **image-to-video (I2V)** task, add `--image ${image_path}` to your command and it is also better to use **text-to-video (T2V)** prompt including the description of the first-frame image.
-> - You can use `--ar_step 5` to enable asynchronous inference. When asynchronous inference, `--causal_block_size 5` is recommanded.
-> - To reduce peak VRAM, lower the `--base_num_frames` for the same generative length `--num_frames`. This may slightly reduce video quality.
 - **Text To Video & Image To Video**
@@ -215,20 +240,27 @@ python3 generate_video.py \
   --shift 8.0 \
   --fps 24 \
   --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
-  --offload
 ```
 > **Note**:
-> - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image  ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommanded for I2V model.
 - **Prompt Enhancer**
-The prompt enhancer is implemented based on <a href="https://huggingface.co/Qwen/Qwen2.5-32B-Instruct">Qwen2.5-32B-Instruct</a> and  is utilized via the `--prompt_enhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if use `--prompt_enhancer`. If you want obtain the enhanced prompt separately, you can also run the prompt_dehancer script separately for testing. The steps are as follows:
 ```shell
 cd skyreels_v2_infer/pipelines
 python3 prompt_enhancer.py --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface."
 ```
 **Advanced Configuration Options**
@@ -248,7 +280,10 @@ Below are the key parameters you can customize for video generation:
 | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) |
 | --use_usp | True | Enables multi-GPU acceleration with xDiT USP |
 | --outdir | ./video_out | Directory where generated videos will be saved |
-| --prompt_enhancer | True | expand the prompt into a more detailed description |
 **Diffusion Forcing Additional Parameters**
 | Parameter | Recommended Value | Description |
@@ -273,7 +308,8 @@ torchrun --nproc_per_node=2 generate_video_df.py \
   --base_num_frames 97 \
   --num_frames 257 \
   --overlap_history 17 \
-  --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
   --use_usp \
   --offload \
   --seed 42
@@ -295,7 +331,7 @@ torchrun --nproc_per_node=2 generate_video.py \
   --seed 42
 ```
 > **Note**:
-> - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image  ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommanded for I2V model.
 ## Contents
@@ -305,7 +341,7 @@ torchrun --nproc_per_node=2 generate_video.py \
     - [Video Captioner](#video-captioner)
     - [Reinforcement Learning](#reinforcement-learning)
     - [Diffusion Forcing](#diffusion-forcing)
-    - [Hight-Quality Supervised Fine-Tuning(SFT)](#high-quality-supervised-fine-tuning-sft)
   - [Performance](#performance)
   - [Acknowledgements](#acknowledgements)
   - [Citation](#citation)
@@ -453,7 +489,7 @@ Inspired by the previous success in LLM, we propose to enhance the performance o
 - the generative model does not handle well with large, deformable motions.
 - the generated videos may violate the physical law.
-To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairsand human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motionquality of the generative model.
 #### Diffusion Forcing
@@ -676,7 +712,7 @@ We would like to thank the contributors of <a href="https://github.com/Wan-Video
 ```bibtex
 @misc{chen2025skyreelsv2infinitelengthfilmgenerative,
       title={SkyReels-V2: Infinite-length Film Generative Model},
-      author={Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Juncheng Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengchen Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang Li and Yahui Zhou},
       year={2025},
       eprint={2504.13074},
       archivePrefix={arXiv},

 <h1 align="center">SkyReels V2: Infinite-Length Film Generative Model</h1>
 <p align="center">
+📑 <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a> · 👋 <a href="https://www.skyreels.ai/home?utm_campaign=github_SkyReels_V2" target="_blank">Playground</a> · 💬 <a href="https://discord.gg/PwM6NYtccQ" target="_blank">Discord</a> · 🤗 <a href="https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9" target="_blank">Hugging Face</a> · 🤖 <a href="https://www.modelscope.cn/collections/SkyReels-V2-f665650130b144" target="_blank">ModelScope</a>
 </p>
 ---
+Welcome to the SkyReels V2 repository! Here, you'll find the model weights and inference code for our infinite-length film generative models
 ## 🔥🔥🔥 News!!
+* Apr 24, 2025: 🔥 We release the 720P models, [SkyReels-V2-DF-14B-720P](https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P) and [SkyReels-V2-I2V-14B-720P](https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P). The former facilitates infinite-length autoregressive video generation, and the latter focuses on Image2Video synthesis.
 * Apr 21, 2025: 👋 We release the inference code and model weights of [SkyReels-V2](https://huggingface.co/collections/Skywork/skyreels-v2-6801b1b93df627d441d0d0d9) Series Models and the video captioning model [SkyCaptioner-V1](https://huggingface.co/Skywork/SkyCaptioner-V1) .
 * Apr 3, 2025: 🔥 We also release [SkyReels-A2](https://github.com/SkyworkAI/SkyReels-A2). This is an open-sourced controllable video generation framework capable of assembling arbitrary visual elements.
+* Feb 18, 2025: 🔥 we released [SkyReels-A1](https://github.com/SkyworkAI/SkyReels-A1). This is an open-sourced and effective framework for portrait image animation.
+* Feb 18, 2025: 🔥 We released [SkyReels-V1](https://github.com/SkyworkAI/SkyReels-V1). This is the first and most advanced open-source human-centric video foundation model.
 ## 🎥 Demos
 <table>
   <tr>
     <td align="center">
+      <video src="https://github.com/user-attachments/assets/f6f9f9a7-5d5f-433c-9d73-d8d593b7ad25" width="100%"></video>
     </td>
     <td align="center">
+      <video src="https://github.com/user-attachments/assets/0eb13415-f4d9-4aaf-bcd3-3031851109b9" width="100%"></video>
     </td>
     <td align="center">
+      <video src="https://github.com/user-attachments/assets/dcd16603-5bf4-4786-8e4d-1ed23889d07a" width="100%"></video>
     </td>
   </tr>
 </table>
 The demos above showcase 30-second videos generated using our SkyReels-V2 Diffusion Forcing model.
 ## 📑 TODO List
 - [x] <a href="https://arxiv.org/pdf/2504.13074">Technical Report</a>
       <td rowspan="5">Diffusion Forcing</td>
       <td>1.3B-540P</td>
       <td>544 * 960 * 97f</td>
+      <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-1.3B-540P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-1.3B-540P">ModelScope</a></td>
     </tr>
     <tr>
       <td>5B-540P</td>
     <tr>
       <td>14B-720P</td>
       <td>720 * 1280 * 121f</td>
+      <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-DF-14B-720P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-DF-14B-720P">ModelScope</a></td>
     </tr>
     <tr>
       <td rowspan="5">Text-to-Video</td>
       <td rowspan="5">Image-to-Video</td>
       <td>1.3B-540P</td>
       <td>544 * 960 * 97f</td>
+      <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-1.3B-540P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-1.3B-540P">ModelScope</a></td>
     </tr>
     <tr>
       <td>5B-540P</td>
     <tr>
       <td>14B-720P</td>
       <td>720 * 1280 * 121f</td>
+      <td>🤗 <a href="https://huggingface.co/Skywork/SkyReels-V2-I2V-14B-720P">Huggingface</a> 🤖 <a href="https://www.modelscope.cn/models/Skywork/SkyReels-V2-I2V-14B-720P">ModelScope</a></td>
     </tr>
     <tr>
       <td rowspan="3">Camera Director</td>
 #### Single GPU Inference
+- **Diffusion Forcing for Long Video Generation**
+The <a href="https://arxiv.org/abs/2407.01392">**Diffusion Forcing**</a> version model allows us to generate Infinite-Length videos. This model supports both **text-to-video (T2V)** and **image-to-video (I2V)** tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.
+synchronous generation for 10s video
 ```shell
 model_id=Skywork/SkyReels-V2-DF-14B-540P
 # synchronous inference
   --overlap_history 17 \
   --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
   --addnoise_condition 20 \
+  --offload \
+  --teacache \
+  --use_ret_steps \
+  --teacache_thresh 0.3
+```
+asynchronous generation for 30s video
+```shell
+model_id=Skywork/SkyReels-V2-DF-14B-540P
+# asynchronous inference
+python3 generate_video_df.py \
+  --model_id ${model_id} \
+  --resolution 540P \
+  --ar_step 5 \
+  --causal_block_size 5 \
+  --base_num_frames 97 \
+  --num_frames 737 \
+  --overlap_history 17 \
+  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
+  --addnoise_condition 20 \
   --offload
 ```
 > **Note**:
+> - If you want to run the **image-to-video (I2V)** task, add `--image ${image_path}` to your command and it is also better to use **text-to-video (T2V)**-like prompt which includes some descriptions of the first-frame image.
+> - For long video generation, you can just switch the `--num_frames`, e.g., `--num_frames 257` for 10s video, `--num_frames 377` for 15s video, `--num_frames 737` for 30s video, `--num_frames 1457` for 60s video. The number is not strictly aligned with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better. When you use asynchronous inference with causal_block_size > 1, the `--num_frames` should be carefully set.
+> - You can use `--ar_step 5` to enable asynchronous inference. When asynchronous inference, `--causal_block_size 5` is recommended while it is not supposed to be set for synchronous generation. REMEMBER that the frame latent number inputted into the model in every iteration, e.g., base frame latent number (e.g., (97-1)//4+1=25 for base_num_frames=97) and (e.g., (237-97-(97-17)x1+17-1)//4+1=20 for base_num_frames=97, num_frames=237, overlap_history=17) for the last iteration, MUST be divided by causal_block_size. If you find it too hard to calculate and set proper values, just use our recommended setting above :). Asynchronous inference will take more steps to diffuse the whole sequence which means it will be SLOWER than synchronous mode. In our experiments, asynchronous inference may improve the instruction following and visual consistent performance.
+> - To reduce peak VRAM, just lower the `--base_num_frames`, e.g., to 77 or 57, while keeping the same generative length `--num_frames` you want to generate. This may slightly reduce video quality, and it should not be set too small.
+> - `--addnoise_condition` is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50.
+> - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM.
 - **Text To Video & Image To Video**
   --shift 8.0 \
   --fps 24 \
   --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface." \
+  --offload \
+  --teacache \
+  --use_ret_steps \
+  --teacache_thresh 0.3
 ```
 > **Note**:
+> - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image  ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommended for I2V model.
+> - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 43.4GB peak VRAM.
 - **Prompt Enhancer**
+The prompt enhancer is implemented based on <a href="https://huggingface.co/Qwen/Qwen2.5-32B-Instruct">Qwen2.5-32B-Instruct</a> and  is utilized via the `--prompt_enhancer` parameter. It works ideally for short prompts, while for long prompts, it might generate an excessively lengthy prompt that could lead to over-saturation in the generative video. Note the peak memory of GPU is 64G+ if you use `--prompt_enhancer`. If you want to obtain the enhanced prompt separately, you can also run the prompt_enhancer script separately for testing. The steps are as follows:
 ```shell
 cd skyreels_v2_infer/pipelines
 python3 prompt_enhancer.py --prompt "A serene lake surrounded by towering mountains, with a few swans gracefully gliding across the water and sunlight dancing on the surface."
 ```
+> **Note**:
+> - `--prompt_enhancer` is not allowed if using `--use_usp`. We recommend running the skyreels_v2_infer/pipelines/prompt_enhancer.py script first to generate enhanced prompt before enabling the `--use_usp` parameter.
 **Advanced Configuration Options**
 | --offload | True | Offloads model components to CPU to reduce VRAM usage (recommended) |
 | --use_usp | True | Enables multi-GPU acceleration with xDiT USP |
 | --outdir | ./video_out | Directory where generated videos will be saved |
+| --prompt_enhancer | True | Expand the prompt into a more detailed description |
+| --teacache | False | Enables teacache for faster inference |
+| --teacache_thresh | 0.2 | Higher speedup will cause to worse quality |
+| --use_ret_steps | False | Retention Steps for teacache |
 **Diffusion Forcing Additional Parameters**
 | Parameter | Recommended Value | Description |
   --base_num_frames 97 \
   --num_frames 257 \
   --overlap_history 17 \
+  --prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
+  --addnoise_condition 20 \
   --use_usp \
   --offload \
   --seed 42
   --seed 42
 ```
 > **Note**:
+> - When using an **image-to-video (I2V)** model, you must provide an input image using the `--image  ${image_path}` parameter. The `--guidance_scale 5.0` and `--shift 3.0` is recommended for I2V model.
 ## Contents
     - [Video Captioner](#video-captioner)
     - [Reinforcement Learning](#reinforcement-learning)
     - [Diffusion Forcing](#diffusion-forcing)
+    - [High-Quality Supervised Fine-Tuning(SFT)](#high-quality-supervised-fine-tuning-sft)
   - [Performance](#performance)
   - [Acknowledgements](#acknowledgements)
   - [Citation](#citation)
 - the generative model does not handle well with large, deformable motions.
 - the generated videos may violate the physical law.
+To avoid the degradation in other metrics, such as text alignment and video quality, we ensure the preference data pairs have comparable text alignment and video quality, while only the motion quality varies. This requirement poses greater challenges in obtaining preference annotations due to the inherently higher costs of human annotation. To address this challenge, we propose a semi-automatic pipeline that strategically combines automatically generated motion pairs and human annotation results. This hybrid approach not only enhances the data scale but also improves alignment with human preferences through curated quality control. Leveraging this enhanced dataset, we first train a specialized reward model to capture the generic motion quality differences between paired samples. This learned reward function subsequently guides the sample selection process for Direct Preference Optimization (DPO), enhancing the motion quality of the generative model.
 #### Diffusion Forcing
 ```bibtex
 @misc{chen2025skyreelsv2infinitelengthfilmgenerative,
       title={SkyReels-V2: Infinite-length Film Generative Model},
+      author={Guibin Chen and Dixuan Lin and Jiangping Yang and Chunze Lin and Junchen Zhu and Mingyuan Fan and Hao Zhang and Sheng Chen and Zheng Chen and Chengcheng Ma and Weiming Xiong and Wei Wang and Nuo Pang and Kang Kang and Zhiheng Xu and Yuzhe Jin and Yupeng Liang and Yubing Song and Peng Zhao and Boyuan Xu and Di Qiu and Debang Li and Zhengcong Fei and Yang Li and Yahui Zhou},
       year={2025},
       eprint={2504.13074},
       archivePrefix={arXiv},