Improve model card: Add metadata, links, and usage example
Browse filesThis PR improves the model card by:
- Adding `pipeline_tag`, `library_name`, `license`, and `tags` to the metadata section for better discoverability and integration with the Hugging Face Hub.
- Updating the Pusa paper link from arXiv to the Hugging Face Papers page.
- Adding a prominent link to the project page (`https://yaofang-liu.github.io/Pusa_Web/`).
- Including a detailed sample usage example for Image-to-Video generation, enhancing usability.
README.md
CHANGED
@@ -1,6 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# Pusa VidGen
|
2 |
|
3 |
-
[Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://
|
4 |
|
5 |
## Overview
|
6 |
|
@@ -8,34 +18,34 @@ Pusa introduces a paradigm shift in video diffusion modeling through frame-level
|
|
8 |
|
9 |
## ✨ Key Features
|
10 |
|
11 |
-
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
|
32 |
## 🔍 Unique Architecture
|
33 |
|
34 |
-
-
|
35 |
|
36 |
-
-
|
37 |
|
38 |
-
-
|
39 |
|
40 |
## Installation and Usage
|
41 |
|
@@ -49,17 +59,171 @@ huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_di
|
|
49 |
|
50 |
**Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
|
51 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
52 |
## Limitations
|
53 |
|
54 |
Pusa currently has several known limitations:
|
55 |
-
-
|
56 |
-
-
|
57 |
-
-
|
58 |
|
59 |
## Related Work
|
60 |
|
61 |
-
-
|
62 |
-
-
|
63 |
|
64 |
## Citation
|
65 |
|
@@ -67,18 +231,18 @@ If you find our work useful in your research, please consider citing:
|
|
67 |
|
68 |
```
|
69 |
@misc{Liu2025pusa,
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
}
|
75 |
```
|
76 |
|
77 |
```
|
78 |
@article{liu2024redefining,
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
}
|
84 |
-
```
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: image-to-video
|
3 |
+
library_name: diffusers
|
4 |
+
license: apache-2.0
|
5 |
+
tags:
|
6 |
+
- text-to-video
|
7 |
+
- diffusion-models
|
8 |
+
- video-generation
|
9 |
+
---
|
10 |
+
|
11 |
# Pusa VidGen
|
12 |
|
13 |
+
[Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Project Page](https://yaofang-liu.github.io/Pusa_Web/) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://huggingface.co/papers/2507.16116) | [FVDM Paper](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
|
14 |
|
15 |
## Overview
|
16 |
|
|
|
18 |
|
19 |
## ✨ Key Features
|
20 |
|
21 |
+
- **Comprehensive Multi-task Support**:
|
22 |
+
- Text-to-Video generation
|
23 |
+
- Image-to-Video transformation
|
24 |
+
- Frame interpolation
|
25 |
+
- Video transitions
|
26 |
+
- Seamless looping
|
27 |
+
- Extended video generation
|
28 |
+
- And more...
|
29 |
+
|
30 |
+
- **Unprecedented Efficiency**:
|
31 |
+
- Trained with only 0.1k H800 GPU hours
|
32 |
+
- Total training cost: $0.1k
|
33 |
+
- Hardware: 16 H800 GPUs
|
34 |
+
- Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
|
35 |
+
- *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
|
36 |
+
|
37 |
+
- **Complete Open-Source Release**:
|
38 |
+
- Full codebase
|
39 |
+
- Detailed architecture specifications
|
40 |
+
- Comprehensive training methodology
|
41 |
|
42 |
## 🔍 Unique Architecture
|
43 |
|
44 |
+
- **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
|
45 |
|
46 |
+
- **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
|
47 |
|
48 |
+
- **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
|
49 |
|
50 |
## Installation and Usage
|
51 |
|
|
|
59 |
|
60 |
**Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
|
61 |
|
62 |
+
### Sample Usage: Image-to-Video Generation
|
63 |
+
|
64 |
+
To generate videos from an image input, you can use the following Python code. This example shows how to perform image-to-video generation using the Pusa model.
|
65 |
+
|
66 |
+
First, ensure you have the necessary libraries installed:
|
67 |
+
```bash
|
68 |
+
pip install torch transformers diffusers pillow imageio numpy torchvision
|
69 |
+
pip install uv # For installing genmo models
|
70 |
+
# Then, navigate to a directory where you want to clone the genmo models
|
71 |
+
git clone https://github.com/genmoai/models
|
72 |
+
cd models
|
73 |
+
uv venv .venv
|
74 |
+
source .venv/bin/activate
|
75 |
+
uv pip install setuptools
|
76 |
+
uv pip install -e . --no-build-isolation
|
77 |
+
# If you want flash attention:
|
78 |
+
# uv pip install -e .[flash] --no-build-isolation
|
79 |
+
```
|
80 |
+
|
81 |
+
Now, you can run the following Python script. Remember to replace `"path/to/Pusa-V0.5"` with the actual local path where you downloaded the model weights.
|
82 |
+
|
83 |
+
```python
|
84 |
+
import numpy as np
|
85 |
+
import torch
|
86 |
+
import torchvision.transforms as T
|
87 |
+
from PIL import Image
|
88 |
+
from torchvision.transforms.functional import InterpolationMode
|
89 |
+
from transformers import AutoModel, AutoTokenizer
|
90 |
+
from diffusers import FlowMatchEulerDiscreteScheduler
|
91 |
+
from genmo.mochi_preview.pipelines import MochiPipeline
|
92 |
+
|
93 |
+
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
94 |
+
IMAGENET_STD = (0.229, 0.224, 0.225)
|
95 |
+
|
96 |
+
def build_transform(input_size):
|
97 |
+
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
|
98 |
+
transform = T.Compose([
|
99 |
+
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
100 |
+
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
101 |
+
T.ToTensor(),
|
102 |
+
T.Normalize(mean=MEAN, std=STD)
|
103 |
+
])
|
104 |
+
return transform
|
105 |
+
|
106 |
+
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
|
107 |
+
best_ratio_diff = float('inf')
|
108 |
+
best_ratio = (1, 1)
|
109 |
+
area = width * height
|
110 |
+
for ratio in target_ratios:
|
111 |
+
target_aspect_ratio = ratio[0] / ratio[1]
|
112 |
+
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
113 |
+
if ratio_diff < best_ratio_diff:
|
114 |
+
best_ratio_diff = ratio_diff
|
115 |
+
best_ratio = ratio
|
116 |
+
elif ratio_diff == best_ratio_diff:
|
117 |
+
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
118 |
+
best_ratio = ratio
|
119 |
+
return best_ratio
|
120 |
+
|
121 |
+
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
|
122 |
+
orig_width, orig_height = image.size
|
123 |
+
aspect_ratio = orig_width / orig_height
|
124 |
+
|
125 |
+
target_ratios = set(
|
126 |
+
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
127 |
+
i * j <= max_num and i * j >= min_num)
|
128 |
+
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
129 |
+
|
130 |
+
target_aspect_ratio = find_closest_aspect_ratio(
|
131 |
+
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
132 |
+
|
133 |
+
target_width = image_size * target_aspect_ratio[0]
|
134 |
+
target_height = image_size * target_aspect_ratio[1]
|
135 |
+
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
136 |
+
|
137 |
+
resized_img = image.resize((target_width, target_height))
|
138 |
+
processed_images = []
|
139 |
+
for i in range(blocks):
|
140 |
+
box = (
|
141 |
+
(i % (target_width // image_size)) * image_size,
|
142 |
+
(i // (target_width // image_size)) * image_size,
|
143 |
+
((i % (target_width // image_size)) + 1) * image_size,
|
144 |
+
((i // (target_width // image_size)) + 1) * image_size
|
145 |
+
)
|
146 |
+
split_img = resized_img.crop(box)
|
147 |
+
processed_images.append(split_img)
|
148 |
+
assert len(processed_images) == blocks
|
149 |
+
if use_thumbnail and len(processed_images) != 1:
|
150 |
+
thumbnail_img = image.resize((image_size, image_size))
|
151 |
+
processed_images.append(thumbnail_img)
|
152 |
+
return processed_images
|
153 |
+
|
154 |
+
def load_image(image_file, input_size=448, max_num=12):
|
155 |
+
image = Image.open(image_file).convert('RGB')
|
156 |
+
transform = build_transform(input_size=input_size)
|
157 |
+
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
158 |
+
pixel_values = [transform(image) for image in images]
|
159 |
+
pixel_values = torch.stack(pixel_values)
|
160 |
+
return pixel_values
|
161 |
+
|
162 |
+
# Load the pipeline
|
163 |
+
# The path below assumes Pusa-V0.5 is downloaded to a directory named Pusa-V0.5
|
164 |
+
# relative to where you run the script, or provide the full path.
|
165 |
+
pipeline = MochiPipeline.from_pretrained(
|
166 |
+
"RaphaelLiu/Pusa-V0.5", # or "/path/to/Pusa-V0.5"
|
167 |
+
torch_dtype=torch.float16,
|
168 |
+
)
|
169 |
+
pipeline.to("cuda") # Ensure pipeline is moved to GPU
|
170 |
+
|
171 |
+
# Load the additional DIT weights for Pusa
|
172 |
+
# Make sure the path to pusa_v0_dit.safetensors is correct
|
173 |
+
dit_weights_path = "RaphaelLiu/Pusa-V0.5/pusa_v0_dit.safetensors" # Adjust if your download path is different
|
174 |
+
pipeline.transformer.load_state_dict(torch.load(dit_weights_path), strict=False)
|
175 |
+
|
176 |
+
# Example parameters for generation
|
177 |
+
prompt = "The camera remains still, the man is surfing on a wave with his surfboard."
|
178 |
+
# Create a dummy image for demonstration if actual image is not present
|
179 |
+
# In a real scenario, replace with a path to a .jpg image file
|
180 |
+
try:
|
181 |
+
image = Image.open("./demos/example.jpg").convert('RGB') # Assumes running from Pusa-VidGen root
|
182 |
+
except FileNotFoundError:
|
183 |
+
print("Example image not found. Creating a dummy image for demonstration.")
|
184 |
+
image = Image.new('RGB', (512, 512), color = 'red')
|
185 |
+
# Save the dummy image for use by the script
|
186 |
+
image.save("./demos/example.jpg")
|
187 |
+
image_path = "./demos/example.jpg"
|
188 |
+
|
189 |
+
cond_position = 0
|
190 |
+
num_steps = 30
|
191 |
+
noise_multiplier = 0.4
|
192 |
+
|
193 |
+
# Load and preprocess the image (using feature_extractor from the pipeline)
|
194 |
+
image_tensor = pipeline.feature_extractor.preprocess(image, return_tensors="pt").pixel_values
|
195 |
+
image_tensor = image_tensor.to(pipeline.device, pipeline.dtype)
|
196 |
+
|
197 |
+
# Generate video
|
198 |
+
video_frames = pipeline(
|
199 |
+
prompt=prompt,
|
200 |
+
image=image_tensor,
|
201 |
+
cond_position=cond_position,
|
202 |
+
num_inference_steps=num_steps,
|
203 |
+
noise_multiplier=noise_multiplier,
|
204 |
+
generator=torch.Generator(device=pipeline.device).manual_seed(0),
|
205 |
+
).frames[0]
|
206 |
+
|
207 |
+
# Save or display the video frames
|
208 |
+
# Example: Save frames as a GIF (requires imageio, Pillow)
|
209 |
+
import imageio
|
210 |
+
|
211 |
+
output_gif_path = "output_video.gif"
|
212 |
+
imageio.mimsave(output_gif_path, [Image.fromarray(f) for f in video_frames], fps=10)
|
213 |
+
print(f"Video saved to {output_gif_path}")
|
214 |
+
```
|
215 |
+
|
216 |
## Limitations
|
217 |
|
218 |
Pusa currently has several known limitations:
|
219 |
+
- The base Mochi model generates videos at relatively low resolution (480p)
|
220 |
+
- We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
|
221 |
+
- We welcome community contributions to enhance model performance and extend its capabilities
|
222 |
|
223 |
## Related Work
|
224 |
|
225 |
+
- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
|
226 |
+
- [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
|
227 |
|
228 |
## Citation
|
229 |
|
|
|
231 |
|
232 |
```
|
233 |
@misc{Liu2025pusa,
|
234 |
+
title={Pusa: Thousands Timesteps Video Diffusion Model},
|
235 |
+
author={Yaofang Liu and Rui Liu},
|
236 |
+
year={2025},
|
237 |
+
url={https://github.com/Yaofang-Liu/Pusa-VidGen},
|
238 |
}
|
239 |
```
|
240 |
|
241 |
```
|
242 |
@article{liu2024redefining,
|
243 |
+
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
|
244 |
+
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
|
245 |
+
journal={arXiv preprint arXiv:2410.03160},
|
246 |
+
year={2024}
|
247 |
}
|
248 |
+
```
|