Improve model card: Add metadata, links, and usage example

#4
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +202 -38
README.md CHANGED
@@ -1,6 +1,16 @@
 
 
 
 
 
 
 
 
 
 
1
  # Pusa VidGen
2
 
3
- [Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://arxiv.org/abs/2507.16116) | [FVDM Paper](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
4
 
5
  ## Overview
6
 
@@ -8,34 +18,34 @@ Pusa introduces a paradigm shift in video diffusion modeling through frame-level
8
 
9
  ## ✨ Key Features
10
 
11
- - **Comprehensive Multi-task Support**:
12
- - Text-to-Video generation
13
- - Image-to-Video transformation
14
- - Frame interpolation
15
- - Video transitions
16
- - Seamless looping
17
- - Extended video generation
18
- - And more...
19
-
20
- - **Unprecedented Efficiency**:
21
- - Trained with only 0.1k H800 GPU hours
22
- - Total training cost: $0.1k
23
- - Hardware: 16 H800 GPUs
24
- - Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
25
- - *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
26
-
27
- - **Complete Open-Source Release**:
28
- - Full codebase
29
- - Detailed architecture specifications
30
- - Comprehensive training methodology
31
 
32
  ## 🔍 Unique Architecture
33
 
34
- - **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
35
 
36
- - **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
37
 
38
- - **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
39
 
40
  ## Installation and Usage
41
 
@@ -49,17 +59,171 @@ huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_di
49
 
50
  **Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ## Limitations
53
 
54
  Pusa currently has several known limitations:
55
- - The base Mochi model generates videos at relatively low resolution (480p)
56
- - We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
57
- - We welcome community contributions to enhance model performance and extend its capabilities
58
 
59
  ## Related Work
60
 
61
- - [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
62
- - [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
63
 
64
  ## Citation
65
 
@@ -67,18 +231,18 @@ If you find our work useful in your research, please consider citing:
67
 
68
  ```
69
  @misc{Liu2025pusa,
70
-   title={Pusa: Thousands Timesteps Video Diffusion Model},
71
-   author={Yaofang Liu and Rui Liu},
72
-   year={2025},
73
-   url={https://github.com/Yaofang-Liu/Pusa-VidGen},
74
  }
75
  ```
76
 
77
  ```
78
  @article{liu2024redefining,
79
-   title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
80
-   author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
81
-   journal={arXiv preprint arXiv:2410.03160},
82
-   year={2024}
83
  }
84
- ```
 
1
+ ---
2
+ pipeline_tag: image-to-video
3
+ library_name: diffusers
4
+ license: apache-2.0
5
+ tags:
6
+ - text-to-video
7
+ - diffusion-models
8
+ - video-generation
9
+ ---
10
+
11
  # Pusa VidGen
12
 
13
+ [Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Project Page](https://yaofang-liu.github.io/Pusa_Web/) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://huggingface.co/papers/2507.16116) | [FVDM Paper](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
14
 
15
  ## Overview
16
 
 
18
 
19
  ## ✨ Key Features
20
 
21
+ - **Comprehensive Multi-task Support**:
22
+ - Text-to-Video generation
23
+ - Image-to-Video transformation
24
+ - Frame interpolation
25
+ - Video transitions
26
+ - Seamless looping
27
+ - Extended video generation
28
+ - And more...
29
+
30
+ - **Unprecedented Efficiency**:
31
+ - Trained with only 0.1k H800 GPU hours
32
+ - Total training cost: $0.1k
33
+ - Hardware: 16 H800 GPUs
34
+ - Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
35
+ - *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
36
+
37
+ - **Complete Open-Source Release**:
38
+ - Full codebase
39
+ - Detailed architecture specifications
40
+ - Comprehensive training methodology
41
 
42
  ## 🔍 Unique Architecture
43
 
44
+ - **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
45
 
46
+ - **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
47
 
48
+ - **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
49
 
50
  ## Installation and Usage
51
 
 
59
 
60
  **Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
61
 
62
+ ### Sample Usage: Image-to-Video Generation
63
+
64
+ To generate videos from an image input, you can use the following Python code. This example shows how to perform image-to-video generation using the Pusa model.
65
+
66
+ First, ensure you have the necessary libraries installed:
67
+ ```bash
68
+ pip install torch transformers diffusers pillow imageio numpy torchvision
69
+ pip install uv # For installing genmo models
70
+ # Then, navigate to a directory where you want to clone the genmo models
71
+ git clone https://github.com/genmoai/models
72
+ cd models
73
+ uv venv .venv
74
+ source .venv/bin/activate
75
+ uv pip install setuptools
76
+ uv pip install -e . --no-build-isolation
77
+ # If you want flash attention:
78
+ # uv pip install -e .[flash] --no-build-isolation
79
+ ```
80
+
81
+ Now, you can run the following Python script. Remember to replace `"path/to/Pusa-V0.5"` with the actual local path where you downloaded the model weights.
82
+
83
+ ```python
84
+ import numpy as np
85
+ import torch
86
+ import torchvision.transforms as T
87
+ from PIL import Image
88
+ from torchvision.transforms.functional import InterpolationMode
89
+ from transformers import AutoModel, AutoTokenizer
90
+ from diffusers import FlowMatchEulerDiscreteScheduler
91
+ from genmo.mochi_preview.pipelines import MochiPipeline
92
+
93
+ IMAGENET_MEAN = (0.485, 0.456, 0.406)
94
+ IMAGENET_STD = (0.229, 0.224, 0.225)
95
+
96
+ def build_transform(input_size):
97
+ MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
98
+ transform = T.Compose([
99
+ T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
100
+ T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
101
+ T.ToTensor(),
102
+ T.Normalize(mean=MEAN, std=STD)
103
+ ])
104
+ return transform
105
+
106
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
107
+ best_ratio_diff = float('inf')
108
+ best_ratio = (1, 1)
109
+ area = width * height
110
+ for ratio in target_ratios:
111
+ target_aspect_ratio = ratio[0] / ratio[1]
112
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
113
+ if ratio_diff < best_ratio_diff:
114
+ best_ratio_diff = ratio_diff
115
+ best_ratio = ratio
116
+ elif ratio_diff == best_ratio_diff:
117
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
118
+ best_ratio = ratio
119
+ return best_ratio
120
+
121
+ def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
122
+ orig_width, orig_height = image.size
123
+ aspect_ratio = orig_width / orig_height
124
+
125
+ target_ratios = set(
126
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
127
+ i * j <= max_num and i * j >= min_num)
128
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
129
+
130
+ target_aspect_ratio = find_closest_aspect_ratio(
131
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
132
+
133
+ target_width = image_size * target_aspect_ratio[0]
134
+ target_height = image_size * target_aspect_ratio[1]
135
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
136
+
137
+ resized_img = image.resize((target_width, target_height))
138
+ processed_images = []
139
+ for i in range(blocks):
140
+ box = (
141
+ (i % (target_width // image_size)) * image_size,
142
+ (i // (target_width // image_size)) * image_size,
143
+ ((i % (target_width // image_size)) + 1) * image_size,
144
+ ((i // (target_width // image_size)) + 1) * image_size
145
+ )
146
+ split_img = resized_img.crop(box)
147
+ processed_images.append(split_img)
148
+ assert len(processed_images) == blocks
149
+ if use_thumbnail and len(processed_images) != 1:
150
+ thumbnail_img = image.resize((image_size, image_size))
151
+ processed_images.append(thumbnail_img)
152
+ return processed_images
153
+
154
+ def load_image(image_file, input_size=448, max_num=12):
155
+ image = Image.open(image_file).convert('RGB')
156
+ transform = build_transform(input_size=input_size)
157
+ images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
158
+ pixel_values = [transform(image) for image in images]
159
+ pixel_values = torch.stack(pixel_values)
160
+ return pixel_values
161
+
162
+ # Load the pipeline
163
+ # The path below assumes Pusa-V0.5 is downloaded to a directory named Pusa-V0.5
164
+ # relative to where you run the script, or provide the full path.
165
+ pipeline = MochiPipeline.from_pretrained(
166
+ "RaphaelLiu/Pusa-V0.5", # or "/path/to/Pusa-V0.5"
167
+ torch_dtype=torch.float16,
168
+ )
169
+ pipeline.to("cuda") # Ensure pipeline is moved to GPU
170
+
171
+ # Load the additional DIT weights for Pusa
172
+ # Make sure the path to pusa_v0_dit.safetensors is correct
173
+ dit_weights_path = "RaphaelLiu/Pusa-V0.5/pusa_v0_dit.safetensors" # Adjust if your download path is different
174
+ pipeline.transformer.load_state_dict(torch.load(dit_weights_path), strict=False)
175
+
176
+ # Example parameters for generation
177
+ prompt = "The camera remains still, the man is surfing on a wave with his surfboard."
178
+ # Create a dummy image for demonstration if actual image is not present
179
+ # In a real scenario, replace with a path to a .jpg image file
180
+ try:
181
+ image = Image.open("./demos/example.jpg").convert('RGB') # Assumes running from Pusa-VidGen root
182
+ except FileNotFoundError:
183
+ print("Example image not found. Creating a dummy image for demonstration.")
184
+ image = Image.new('RGB', (512, 512), color = 'red')
185
+ # Save the dummy image for use by the script
186
+ image.save("./demos/example.jpg")
187
+ image_path = "./demos/example.jpg"
188
+
189
+ cond_position = 0
190
+ num_steps = 30
191
+ noise_multiplier = 0.4
192
+
193
+ # Load and preprocess the image (using feature_extractor from the pipeline)
194
+ image_tensor = pipeline.feature_extractor.preprocess(image, return_tensors="pt").pixel_values
195
+ image_tensor = image_tensor.to(pipeline.device, pipeline.dtype)
196
+
197
+ # Generate video
198
+ video_frames = pipeline(
199
+ prompt=prompt,
200
+ image=image_tensor,
201
+ cond_position=cond_position,
202
+ num_inference_steps=num_steps,
203
+ noise_multiplier=noise_multiplier,
204
+ generator=torch.Generator(device=pipeline.device).manual_seed(0),
205
+ ).frames[0]
206
+
207
+ # Save or display the video frames
208
+ # Example: Save frames as a GIF (requires imageio, Pillow)
209
+ import imageio
210
+
211
+ output_gif_path = "output_video.gif"
212
+ imageio.mimsave(output_gif_path, [Image.fromarray(f) for f in video_frames], fps=10)
213
+ print(f"Video saved to {output_gif_path}")
214
+ ```
215
+
216
  ## Limitations
217
 
218
  Pusa currently has several known limitations:
219
+ - The base Mochi model generates videos at relatively low resolution (480p)
220
+ - We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
221
+ - We welcome community contributions to enhance model performance and extend its capabilities
222
 
223
  ## Related Work
224
 
225
+ - [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
226
+ - [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
227
 
228
  ## Citation
229
 
 
231
 
232
  ```
233
  @misc{Liu2025pusa,
234
+ title={Pusa: Thousands Timesteps Video Diffusion Model},
235
+ author={Yaofang Liu and Rui Liu},
236
+ year={2025},
237
+ url={https://github.com/Yaofang-Liu/Pusa-VidGen},
238
  }
239
  ```
240
 
241
  ```
242
  @article{liu2024redefining,
243
+ title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
244
+ author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
245
+ journal={arXiv preprint arXiv:2410.03160},
246
+ year={2024}
247
  }
248
+ ```