init upload

Browse files

Files changed (7) hide show

.gitattributes +0 -2
README.md +18 -14
assets/.DS_Store +0 -3
assets/comp_effic.png +2 -2
assets/input.png +0 -3
assets/vben_vs_sota.png +2 -2
assets/video_vae_res.jpg +2 -2

.gitattributes CHANGED Viewed

@@ -37,7 +37,6 @@ google/umt5-xxl/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 assets/comp_effic.png filter=lfs diff=lfs merge=lfs -text
 assets/data_for_diff_stage.jpg filter=lfs diff=lfs merge=lfs -text
 assets/i2v_res.png filter=lfs diff=lfs merge=lfs -text
-assets/input.png filter=lfs diff=lfs merge=lfs -text
 assets/logo.png filter=lfs diff=lfs merge=lfs -text
 assets/t2v_res.jpg filter=lfs diff=lfs merge=lfs -text
 assets/vben_vs_sota.png filter=lfs diff=lfs merge=lfs -text
@@ -45,5 +44,4 @@ assets/vben_vs_sota_t2i.jpg filter=lfs diff=lfs merge=lfs -text
 assets/video_dit_arch.jpg filter=lfs diff=lfs merge=lfs -text
 assets/video_vae_res.jpg filter=lfs diff=lfs merge=lfs -text
 examples/i2v_input.JPG filter=lfs diff=lfs merge=lfs -text
-assets/.DS_Store filter=lfs diff=lfs merge=lfs -text
 assets/vben_1.3b_vs_sota.png filter=lfs diff=lfs merge=lfs -text

 assets/comp_effic.png filter=lfs diff=lfs merge=lfs -text
 assets/data_for_diff_stage.jpg filter=lfs diff=lfs merge=lfs -text
 assets/i2v_res.png filter=lfs diff=lfs merge=lfs -text
 assets/logo.png filter=lfs diff=lfs merge=lfs -text
 assets/t2v_res.jpg filter=lfs diff=lfs merge=lfs -text
 assets/vben_vs_sota.png filter=lfs diff=lfs merge=lfs -text
 assets/video_dit_arch.jpg filter=lfs diff=lfs merge=lfs -text
 assets/video_vae_res.jpg filter=lfs diff=lfs merge=lfs -text
 examples/i2v_input.JPG filter=lfs diff=lfs merge=lfs -text
 assets/vben_1.3b_vs_sota.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -5,12 +5,12 @@
 <p>
 <p align="center">
-    💜 <a href=""><b>Wan</b></a> &nbsp&nbsp ｜ &nbsp&nbsp 🖥️ <a href="https://github.com/Wan-Video/Wan2.1">GitHub</a> &nbsp&nbsp  | &nbsp&nbsp🤗 <a href="https://huggingface.co/Wan-AI/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/Wan-AI">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="">Paper</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="">Blog</a> &nbsp&nbsp | &nbsp&nbsp💬 <a href="">WeChat (微信)</a>&nbsp&nbsp | &nbsp&nbsp 📖 <a href="https://discord.gg/p5XbdQV7">Discord</a>&nbsp&nbsp
 <br>
 -----
-[**Wan: Open and Advanced Large-Scale Video Generative Models**]("#") <be>
 In this repository, we present **Wan2.1**, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. **Wan2.1** offers these key features:
 - 👍 **SOTA Performance**: **Wan2.1** consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.
@@ -19,7 +19,6 @@ In this repository, we present **Wan2.1**, a comprehensive and open suite of vid
 - 👍 **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
 - 👍 **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
 This repository features our T2V-14B model, which establishes a new SOTA performance benchmark among both open-source and closed-source models. It demonstrates exceptional capabilities in generating high-quality visuals with significant motion dynamics. It is also the only video model capable of producing both Chinese and English text and supports video generation at both 480P and 720P resolutions.
@@ -72,10 +71,10 @@ pip install -r requirements.txt
 | Models        |                       Download Link                                           |    Notes                      |
 | --------------|-------------------------------------------------------------------------------|-------------------------------|
-| T2V-14B       |      [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)            | Supports both 480P and 720P
-| I2V-14B-720P  |      [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P)       | Supports 720P
-| I2V-14B-480P  |      [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P)       | Supports 480P
-| T2V-1.3B      |      [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B)           | Supports 480P
 > 💡Note: The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
@@ -83,7 +82,7 @@ pip install -r requirements.txt
 Download models using huggingface-cli:
 ```
 pip install "huggingface_hub[cli]"
-huggingface-cli download --resume-download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
 ```
 #### Run Text-to-Video Generation
@@ -135,7 +134,8 @@ If you encounter OOM (Out-of-Memory) issues, you can use the `--offload_model Tr
 python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
 ```
-> 💡Note: If you use `T2V-1.3B` model, we recommend use parameter `--sample_shift 8 --sample_guide_scale 6`
 - Multi-GPU inference using FSDP + xDiT USP
@@ -150,8 +150,8 @@ torchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720 --ckpt_di
 Extending the prompts can effectively enrich the details in the generated videos, further enhancing the video quality. Therefore, we recommend enabling prompt extension. We provide the following two methods for prompt extension:
 - Use the Dashscope API for extension.
-  - Apply for a `dashscope.api_key` in advance ([Application Link](https://help.aliyun.com/zh/dashscope/developer-reference/qwen-api)).
-  - Configure the environment variable `DASH_API_KEY` to specify the Dashscope API key. For users of Alibaba Cloud's international site, you also need to set the environment variable `DASH_API_URL` to 'https://dashscope-intl.aliyuncs.com/api/v1'. For more detailed instructions, please refer to the [dashscope document](https://www.alibabacloud.com/help/en/model-studio/user-guide/vision/?spm=a2c63.p38356.help-menu-2400256.d_1_0_1.50ea5f94ROV2Ar).
   - Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks.
   - You can modify the model used for extension with the parameter `--prompt_extend_model`. For example:
 ```
@@ -208,6 +208,10 @@ We test the computational efficiency of different **Wan2.1** models on different
 > (3) For the 1.3B model on a single 4090 GPU, set `--offload_model True --t5_cpu`;
 > (4) For all testings, no prompt extension was applied, meaning `--use_prompt_extend` was not enabled.
 -------
 ## Introduction of Wan2.1
@@ -248,7 +252,7 @@ We curated and deduplicated a candidate dataset comprising a vast amount of imag
 ##### Comparisons to SOTA
-We compared **Wan2.1** with leading open-source and closed-source models to evaluate the performace. Using our carefully designed set of 1,035 internal prompts, we tested across 14 major dimensions and 26 sub-dimensions. Then we calculated the total score through a weighted average based on the importance of each dimension. The detailed results are shown in the table below. These results demonstrate our model's superior performance compared to both open-source and closed-source models.
 ![figure1](assets/vben_vs_sota.png "figure1")
@@ -271,9 +275,9 @@ The models in this repository are licensed under the Apache 2.0 License. We clai
 ## Acknowledgements
-We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [QWen](https://huggingface.co/Qwen), [umt5-xxl](https://huggingface.co/google/umt5-xxl), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
 ## Contact Us
-If you would like to leave a message to our research or product teams, feel free to join our [Discord](https://discord.gg/p5XbdQV7) or [WeChat groups]()!

 <p>
 <p align="center">
+    💜 <a href=""><b>Wan</b></a> &nbsp&nbsp ｜ &nbsp&nbsp 🖥️ <a href="https://github.com/Wan-Video/Wan2.1">GitHub</a> &nbsp&nbsp  | &nbsp&nbsp🤗 <a href="https://huggingface.co/Wan-AI/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/Wan-AI">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="">Paper (Coming soon)</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://wanxai.com">Blog</a> &nbsp&nbsp | &nbsp&nbsp💬 <a href="https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg">WeChat Group</a>&nbsp&nbsp | &nbsp&nbsp 📖 <a href="https://discord.gg/p5XbdQV7">Discord</a>&nbsp&nbsp
 <br>
 -----
+[**Wan: Open and Advanced Large-Scale Video Generative Models**]("") <be>
 In this repository, we present **Wan2.1**, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. **Wan2.1** offers these key features:
 - 👍 **SOTA Performance**: **Wan2.1** consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.
 - 👍 **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
 - 👍 **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
 This repository features our T2V-14B model, which establishes a new SOTA performance benchmark among both open-source and closed-source models. It demonstrates exceptional capabilities in generating high-quality visuals with significant motion dynamics. It is also the only video model capable of producing both Chinese and English text and supports video generation at both 480P and 720P resolutions.
 | Models        |                       Download Link                                           |    Notes                      |
 | --------------|-------------------------------------------------------------------------------|-------------------------------|
+| T2V-14B       |      🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B)      🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B)          | Supports both 480P and 720P
+| I2V-14B-720P  |      🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P)    🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P)     | Supports 720P
+| I2V-14B-480P  |      🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P)    🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P)      | Supports 480P
+| T2V-1.3B      |      🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B)     🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B)         | Supports 480P
 > 💡Note: The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
 Download models using huggingface-cli:
 ```
 pip install "huggingface_hub[cli]"
+huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
 ```
 #### Run Text-to-Video Generation
 python generate.py  --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
 ```
+> 💡Note: If you are using the `T2V-1.3B` model, we recommend setting the parameter `--sample_guide_scale 6`. The `--sample_shift parameter` can be adjusted within the range of 8 to 12 based on the performance.
 - Multi-GPU inference using FSDP + xDiT USP
 Extending the prompts can effectively enrich the details in the generated videos, further enhancing the video quality. Therefore, we recommend enabling prompt extension. We provide the following two methods for prompt extension:
 - Use the Dashscope API for extension.
+  - Apply for a `dashscope.api_key` in advance ([EN](https://www.alibabacloud.com/help/en/model-studio/getting-started/first-api-call-to-qwen) | [CN](https://help.aliyun.com/zh/model-studio/getting-started/first-api-call-to-qwen)).
+  - Configure the environment variable `DASH_API_KEY` to specify the Dashscope API key. For users of Alibaba Cloud's international site, you also need to set the environment variable `DASH_API_URL` to 'https://dashscope-intl.aliyuncs.com/api/v1'. For more detailed instructions, please refer to the [dashscope document](https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api?spm=a2c63.p38356.0.i1).
   - Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks.
   - You can modify the model used for extension with the parameter `--prompt_extend_model`. For example:
 ```
 > (3) For the 1.3B model on a single 4090 GPU, set `--offload_model True --t5_cpu`;
 > (4) For all testings, no prompt extension was applied, meaning `--use_prompt_extend` was not enabled.
+## Community Contributions
+- [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) provides more support for Wan, including video-to-video, FP8 quantization, VRAM optimization, LoRA training, and more. Please refer to [their examples](https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/wanvideo).
 -------
 ## Introduction of Wan2.1
 ##### Comparisons to SOTA
+We compared **Wan2.1** with leading open-source and closed-source models to evaluate the performace. Using our carefully designed set of 1,035 internal prompts, we tested across 14 major dimensions and 26 sub-dimensions. We then compute the total score by performing a weighted calculation on the scores of each dimension, utilizing weights derived from human preferences in the matching process. The detailed results are shown in the table below. These results demonstrate our model's superior performance compared to both open-source and closed-source models.
 ![figure1](assets/vben_vs_sota.png "figure1")
 ## Acknowledgements
+We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Qwen](https://huggingface.co/Qwen), [umt5-xxl](https://huggingface.co/google/umt5-xxl), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research.
 ## Contact Us
+If you would like to leave a message to our research or product teams, feel free to join our [Discord](https://discord.gg/p5XbdQV7) or [WeChat groups](https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg)!