WanX-Video-1 commited on
Commit
a41139f
ยท
1 Parent(s): c55421c

init upload

Browse files
.gitattributes CHANGED
@@ -37,7 +37,6 @@ google/umt5-xxl/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
  assets/comp_effic.png filter=lfs diff=lfs merge=lfs -text
38
  assets/data_for_diff_stage.jpg filter=lfs diff=lfs merge=lfs -text
39
  assets/i2v_res.png filter=lfs diff=lfs merge=lfs -text
40
- assets/input.png filter=lfs diff=lfs merge=lfs -text
41
  assets/logo.png filter=lfs diff=lfs merge=lfs -text
42
  assets/t2v_res.jpg filter=lfs diff=lfs merge=lfs -text
43
  assets/vben_vs_sota.png filter=lfs diff=lfs merge=lfs -text
@@ -45,5 +44,4 @@ assets/vben_vs_sota_t2i.jpg filter=lfs diff=lfs merge=lfs -text
45
  assets/video_dit_arch.jpg filter=lfs diff=lfs merge=lfs -text
46
  assets/video_vae_res.jpg filter=lfs diff=lfs merge=lfs -text
47
  examples/i2v_input.JPG filter=lfs diff=lfs merge=lfs -text
48
- assets/.DS_Store filter=lfs diff=lfs merge=lfs -text
49
  assets/vben_1.3b_vs_sota.png filter=lfs diff=lfs merge=lfs -text
 
37
  assets/comp_effic.png filter=lfs diff=lfs merge=lfs -text
38
  assets/data_for_diff_stage.jpg filter=lfs diff=lfs merge=lfs -text
39
  assets/i2v_res.png filter=lfs diff=lfs merge=lfs -text
 
40
  assets/logo.png filter=lfs diff=lfs merge=lfs -text
41
  assets/t2v_res.jpg filter=lfs diff=lfs merge=lfs -text
42
  assets/vben_vs_sota.png filter=lfs diff=lfs merge=lfs -text
 
44
  assets/video_dit_arch.jpg filter=lfs diff=lfs merge=lfs -text
45
  assets/video_vae_res.jpg filter=lfs diff=lfs merge=lfs -text
46
  examples/i2v_input.JPG filter=lfs diff=lfs merge=lfs -text
 
47
  assets/vben_1.3b_vs_sota.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -5,12 +5,12 @@
5
  <p>
6
 
7
  <p align="center">
8
- ๐Ÿ’œ <a href=""><b>Wan</b></a> &nbsp&nbsp ๏ฝœ &nbsp&nbsp ๐Ÿ–ฅ๏ธ <a href="https://github.com/Wan-Video/Wan2.1">GitHub</a> &nbsp&nbsp | &nbsp&nbsp๐Ÿค— <a href="https://huggingface.co/Wan-AI/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿค– <a href="https://modelscope.cn/organization/Wan-AI">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp ๐Ÿ“‘ <a href="">Paper</a> &nbsp&nbsp | &nbsp&nbsp ๐Ÿ“‘ <a href="">Blog</a> &nbsp&nbsp | &nbsp&nbsp๐Ÿ’ฌ <a href="">WeChat (ๅพฎไฟก)</a>&nbsp&nbsp | &nbsp&nbsp ๐Ÿ“– <a href="https://discord.gg/p5XbdQV7">Discord</a>&nbsp&nbsp
9
  <br>
10
 
11
  -----
12
 
13
- [**Wan: Open and Advanced Large-Scale Video Generative Models**]("#") <be>
14
 
15
  In this repository, we present **Wan2.1**, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. **Wan2.1** offers these key features:
16
  - ๐Ÿ‘ **SOTA Performance**: **Wan2.1** consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.
@@ -19,7 +19,6 @@ In this repository, we present **Wan2.1**, a comprehensive and open suite of vid
19
  - ๐Ÿ‘ **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
20
  - ๐Ÿ‘ **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
21
 
22
-
23
  This repository features our T2V-14B model, which establishes a new SOTA performance benchmark among both open-source and closed-source models. It demonstrates exceptional capabilities in generating high-quality visuals with significant motion dynamics. It is also the only video model capable of producing both Chinese and English text and supports video generation at both 480P and 720P resolutions.
24
 
25
 
@@ -72,10 +71,10 @@ pip install -r requirements.txt
72
 
73
  | Models | Download Link | Notes |
74
  | --------------|-------------------------------------------------------------------------------|-------------------------------|
75
- | T2V-14B | [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | Supports both 480P and 720P
76
- | I2V-14B-720P | [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
77
- | I2V-14B-480P | [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
78
- | T2V-1.3B | [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | Supports 480P
79
 
80
  > ๐Ÿ’กNote: The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
81
 
@@ -83,7 +82,7 @@ pip install -r requirements.txt
83
  Download models using huggingface-cli:
84
  ```
85
  pip install "huggingface_hub[cli]"
86
- huggingface-cli download --resume-download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
87
  ```
88
 
89
  #### Run Text-to-Video Generation
@@ -135,7 +134,8 @@ If you encounter OOM (Out-of-Memory) issues, you can use the `--offload_model Tr
135
  python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
136
  ```
137
 
138
- > ๐Ÿ’กNote: If you use `T2V-1.3B` model, we recommend use parameter `--sample_shift 8 --sample_guide_scale 6`
 
139
 
140
  - Multi-GPU inference using FSDP + xDiT USP
141
 
@@ -150,8 +150,8 @@ torchrun --nproc_per_node=8 generate.py --task t2v-14B --size 1280*720 --ckpt_di
150
  Extending the prompts can effectively enrich the details in the generated videos, further enhancing the video quality. Therefore, we recommend enabling prompt extension. We provide the following two methods for prompt extension:
151
 
152
  - Use the Dashscope API for extension.
153
- - Apply for a `dashscope.api_key` in advance ([Application Link](https://help.aliyun.com/zh/dashscope/developer-reference/qwen-api)).
154
- - Configure the environment variable `DASH_API_KEY` to specify the Dashscope API key. For users of Alibaba Cloud's international site, you also need to set the environment variable `DASH_API_URL` to 'https://dashscope-intl.aliyuncs.com/api/v1'. For more detailed instructions, please refer to the [dashscope document](https://www.alibabacloud.com/help/en/model-studio/user-guide/vision/?spm=a2c63.p38356.help-menu-2400256.d_1_0_1.50ea5f94ROV2Ar).
155
  - Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks.
156
  - You can modify the model used for extension with the parameter `--prompt_extend_model`. For example:
157
  ```
@@ -208,6 +208,10 @@ We test the computational efficiency of different **Wan2.1** models on different
208
  > (3) For the 1.3B model on a single 4090 GPU, set `--offload_model True --t5_cpu`;
209
  > (4) For all testings, no prompt extension was applied, meaning `--use_prompt_extend` was not enabled.
210
 
 
 
 
 
211
  -------
212
 
213
  ## Introduction of Wan2.1
@@ -248,7 +252,7 @@ We curated and deduplicated a candidate dataset comprising a vast amount of imag
248
 
249
 
250
  ##### Comparisons to SOTA
251
- We compared **Wan2.1** with leading open-source and closed-source models to evaluate the performace. Using our carefully designed set of 1,035 internal prompts, we tested across 14 major dimensions and 26 sub-dimensions. Then we calculated the total score through a weighted average based on the importance of each dimension. The detailed results are shown in the table below. These results demonstrate our model's superior performance compared to both open-source and closed-source models.
252
 
253
  ![figure1](assets/vben_vs_sota.png "figure1")
254
 
@@ -271,9 +275,9 @@ The models in this repository are licensed under the Apache 2.0 License. We clai
271
 
272
  ## Acknowledgements
273
 
274
- We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [QWen](https://huggingface.co/Qwen), [umt5-xxl](https://huggingface.co/google/umt5-xxl), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
275
 
276
 
277
 
278
  ## Contact Us
279
- If you would like to leave a message to our research or product teams, feel free to join our [Discord](https://discord.gg/p5XbdQV7) or [WeChat groups]()!
 
5
  <p>
6
 
7
  <p align="center">
8
+ ๐Ÿ’œ <a href=""><b>Wan</b></a> &nbsp&nbsp ๏ฝœ &nbsp&nbsp ๐Ÿ–ฅ๏ธ <a href="https://github.com/Wan-Video/Wan2.1">GitHub</a> &nbsp&nbsp | &nbsp&nbsp๐Ÿค— <a href="https://huggingface.co/Wan-AI/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp๐Ÿค– <a href="https://modelscope.cn/organization/Wan-AI">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp ๐Ÿ“‘ <a href="">Paper (Coming soon)</a> &nbsp&nbsp | &nbsp&nbsp ๐Ÿ“‘ <a href="https://wanxai.com">Blog</a> &nbsp&nbsp | &nbsp&nbsp๐Ÿ’ฌ <a href="https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg">WeChat Group</a>&nbsp&nbsp | &nbsp&nbsp ๐Ÿ“– <a href="https://discord.gg/p5XbdQV7">Discord</a>&nbsp&nbsp
9
  <br>
10
 
11
  -----
12
 
13
+ [**Wan: Open and Advanced Large-Scale Video Generative Models**]("") <be>
14
 
15
  In this repository, we present **Wan2.1**, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. **Wan2.1** offers these key features:
16
  - ๐Ÿ‘ **SOTA Performance**: **Wan2.1** consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.
 
19
  - ๐Ÿ‘ **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
20
  - ๐Ÿ‘ **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
21
 
 
22
  This repository features our T2V-14B model, which establishes a new SOTA performance benchmark among both open-source and closed-source models. It demonstrates exceptional capabilities in generating high-quality visuals with significant motion dynamics. It is also the only video model capable of producing both Chinese and English text and supports video generation at both 480P and 720P resolutions.
23
 
24
 
 
71
 
72
  | Models | Download Link | Notes |
73
  | --------------|-------------------------------------------------------------------------------|-------------------------------|
74
+ | T2V-14B | ๐Ÿค— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B) | Supports both 480P and 720P
75
+ | I2V-14B-720P | ๐Ÿค— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
76
+ | I2V-14B-480P | ๐Ÿค— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
77
+ | T2V-1.3B | ๐Ÿค— [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) ๐Ÿค– [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B) | Supports 480P
78
 
79
  > ๐Ÿ’กNote: The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
80
 
 
82
  Download models using huggingface-cli:
83
  ```
84
  pip install "huggingface_hub[cli]"
85
+ huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
86
  ```
87
 
88
  #### Run Text-to-Video Generation
 
134
  python generate.py --task t2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --offload_model True --t5_cpu --sample_shift 8 --sample_guide_scale 6 --prompt "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
135
  ```
136
 
137
+ > ๐Ÿ’กNote: If you are using the `T2V-1.3B` model, we recommend setting the parameter `--sample_guide_scale 6`. The `--sample_shift parameter` can be adjusted within the range of 8 to 12 based on the performance.
138
+
139
 
140
  - Multi-GPU inference using FSDP + xDiT USP
141
 
 
150
  Extending the prompts can effectively enrich the details in the generated videos, further enhancing the video quality. Therefore, we recommend enabling prompt extension. We provide the following two methods for prompt extension:
151
 
152
  - Use the Dashscope API for extension.
153
+ - Apply for a `dashscope.api_key` in advance ([EN](https://www.alibabacloud.com/help/en/model-studio/getting-started/first-api-call-to-qwen) | [CN](https://help.aliyun.com/zh/model-studio/getting-started/first-api-call-to-qwen)).
154
+ - Configure the environment variable `DASH_API_KEY` to specify the Dashscope API key. For users of Alibaba Cloud's international site, you also need to set the environment variable `DASH_API_URL` to 'https://dashscope-intl.aliyuncs.com/api/v1'. For more detailed instructions, please refer to the [dashscope document](https://www.alibabacloud.com/help/en/model-studio/developer-reference/use-qwen-by-calling-api?spm=a2c63.p38356.0.i1).
155
  - Use the `qwen-plus` model for text-to-video tasks and `qwen-vl-max` for image-to-video tasks.
156
  - You can modify the model used for extension with the parameter `--prompt_extend_model`. For example:
157
  ```
 
208
  > (3) For the 1.3B model on a single 4090 GPU, set `--offload_model True --t5_cpu`;
209
  > (4) For all testings, no prompt extension was applied, meaning `--use_prompt_extend` was not enabled.
210
 
211
+
212
+ ## Community Contributions
213
+ - [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) provides more support for Wan, including video-to-video, FP8 quantization, VRAM optimization, LoRA training, and more. Please refer to [their examples](https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/wanvideo).
214
+
215
  -------
216
 
217
  ## Introduction of Wan2.1
 
252
 
253
 
254
  ##### Comparisons to SOTA
255
+ We compared **Wan2.1** with leading open-source and closed-source models to evaluate the performace. Using our carefully designed set of 1,035 internal prompts, we tested across 14 major dimensions and 26 sub-dimensions. We then compute the total score by performing a weighted calculation on the scores of each dimension, utilizing weights derived from human preferences in the matching process. The detailed results are shown in the table below. These results demonstrate our model's superior performance compared to both open-source and closed-source models.
256
 
257
  ![figure1](assets/vben_vs_sota.png "figure1")
258
 
 
275
 
276
  ## Acknowledgements
277
 
278
+ We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Qwen](https://huggingface.co/Qwen), [umt5-xxl](https://huggingface.co/google/umt5-xxl), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research.
279
 
280
 
281
 
282
  ## Contact Us
283
+ If you would like to leave a message to our research or product teams, feel free to join our [Discord](https://discord.gg/p5XbdQV7) or [WeChat groups](https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg)!
assets/.DS_Store DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d65165279105ca6773180500688df4bdc69a2c7b771752f0a46ef120b7fd8ec3
3
- size 6148
 
 
 
 
assets/comp_effic.png CHANGED

Git LFS Details

  • SHA256: b1b23457157a494ebe834306962e927768830e26a2d51b896929d2d7cba54dd6
  • Pointer size: 132 Bytes
  • Size of remote file: 1.6 MB

Git LFS Details

  • SHA256: b0e225caffb4b31295ad150f95ee852e4c3dde4a00ac8f79a2ff500f2ce26b8d
  • Pointer size: 132 Bytes
  • Size of remote file: 1.79 MB
assets/input.png DELETED

Git LFS Details

  • SHA256: da5825447ffdefe9728c0e99caf7724a258c79d9afc0e4ec47421f16b4bc27b7
  • Pointer size: 132 Bytes
  • Size of remote file: 1.07 MB
assets/vben_vs_sota.png CHANGED

Git LFS Details

  • SHA256: d32d27b128f46b6d3abe3cdaec4966629fcb86ae7658679ed1c985eec8541c4b
  • Pointer size: 131 Bytes
  • Size of remote file: 584 kB

Git LFS Details

  • SHA256: 9a0e86ca85046d2675f97984b88b6e74df07bba8a62a31ab8a1aef50d4eda44e
  • Pointer size: 132 Bytes
  • Size of remote file: 1.55 MB
assets/video_vae_res.jpg CHANGED

Git LFS Details

  • SHA256: 4e98374a200c3a0b3a4d1322d4d3dfe33ff62019812a6338c947cfd21efbfc5f
  • Pointer size: 131 Bytes
  • Size of remote file: 212 kB

Git LFS Details

  • SHA256: d8f9e7f7353848056a615c8ef35ab86ec22976bb46cb27405008b4089701945c
  • Pointer size: 131 Bytes
  • Size of remote file: 213 kB