RaphaelLiu
/

Pusa-V0.5

@@ -1,6 +1,16 @@
 # Pusa VidGen
-[Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://arxiv.org/abs/2507.16116) | [FVDM Paper](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
 ## Overview
@@ -8,34 +18,34 @@ Pusa introduces a paradigm shift in video diffusion modeling through frame-level
 ## ✨ Key Features
-- **Comprehensive Multi-task Support**:
-  - Text-to-Video generation
-  - Image-to-Video transformation
-  - Frame interpolation
-  - Video transitions
-  - Seamless looping
-  - Extended video generation
-  - And more...
-- **Unprecedented Efficiency**:
-  - Trained with only 0.1k H800 GPU hours
-  - Total training cost: $0.1k
-  - Hardware: 16 H800 GPUs
-  - Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
-  - *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
-- **Complete Open-Source Release**:
-  - Full codebase
-  - Detailed architecture specifications
-  - Comprehensive training methodology
 ## 🔍 Unique Architecture
-- **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
-- **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
-- **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
 ## Installation and Usage
@@ -49,17 +59,171 @@ huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_di
 **Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
 ## Limitations
 Pusa currently has several known limitations:
-- The base Mochi model generates videos at relatively low resolution (480p)
-- We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
-- We welcome community contributions to enhance model performance and extend its capabilities
 ## Related Work
-- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
-- [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
 ## Citation
@@ -67,18 +231,18 @@ If you find our work useful in your research, please consider citing:
 ```
 @misc{Liu2025pusa,
-  title={Pusa: Thousands Timesteps Video Diffusion Model},
-  author={Yaofang Liu and Rui Liu},
-  year={2025},
-  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
 }
 ```
 ```
 @article{liu2024redefining,
-  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
-  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
-  journal={arXiv preprint arXiv:2410.03160},
-  year={2024}
 }
-```

+---
+pipeline_tag: image-to-video
+library_name: diffusers
+license: apache-2.0
+tags:
+- text-to-video
+- diffusion-models
+- video-generation
+---
 # Pusa VidGen
+[Code Repository](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Project Page](https://yaofang-liu.github.io/Pusa_Web/) | [Model Hub](https://huggingface.co/RaphaelLiu/Pusa-V0.5) | [Training Toolkit](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner) | [Dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training) | [Pusa Paper](https://huggingface.co/papers/2507.16116) | [FVDM Paper](https://arxiv.org/abs/2410.03160) | [Follow on X](https://x.com/stephenajason) | [Xiaohongshu](https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748)
 ## Overview
 ## ✨ Key Features
+-   **Comprehensive Multi-task Support**:
+    -   Text-to-Video generation
+    -   Image-to-Video transformation
+    -   Frame interpolation
+    -   Video transitions
+    -   Seamless looping
+    -   Extended video generation
+    -   And more...
+-   **Unprecedented Efficiency**:
+    -   Trained with only 0.1k H800 GPU hours
+    -   Total training cost: $0.1k
+    -   Hardware: 16 H800 GPUs
+    -   Configuration: Batch size 32, 500 training iterations, 1e-5 learning rate
+    -   *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
+-   **Complete Open-Source Release**:
+    -   Full codebase
+    -   Detailed architecture specifications
+    -   Comprehensive training methodology
 ## 🔍 Unique Architecture
+-   **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
+-   **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
+-   **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
 ## Installation and Usage
 **Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
+### Sample Usage: Image-to-Video Generation
+To generate videos from an image input, you can use the following Python code. This example shows how to perform image-to-video generation using the Pusa model.
+First, ensure you have the necessary libraries installed:
+```bash
+pip install torch transformers diffusers pillow imageio numpy torchvision
+pip install uv # For installing genmo models
+# Then, navigate to a directory where you want to clone the genmo models
+git clone https://github.com/genmoai/models
+cd models
+uv venv .venv
+source .venv/bin/activate
+uv pip install setuptools
+uv pip install -e . --no-build-isolation
+# If you want flash attention:
+# uv pip install -e .[flash] --no-build-isolation
+```
+Now, you can run the following Python script. Remember to replace `"path/to/Pusa-V0.5"` with the actual local path where you downloaded the model weights.
+```python
+import numpy as np
+import torch
+import torchvision.transforms as T
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+from transformers import AutoModel, AutoTokenizer
+from diffusers import FlowMatchEulerDiscreteScheduler
+from genmo.mochi_preview.pipelines import MochiPipeline
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+def load_image(image_file, input_size=448, max_num=12):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+# Load the pipeline
+# The path below assumes Pusa-V0.5 is downloaded to a directory named Pusa-V0.5
+# relative to where you run the script, or provide the full path.
+pipeline = MochiPipeline.from_pretrained(
+    "RaphaelLiu/Pusa-V0.5", # or "/path/to/Pusa-V0.5"
+    torch_dtype=torch.float16,
+)
+pipeline.to("cuda") # Ensure pipeline is moved to GPU
+# Load the additional DIT weights for Pusa
+# Make sure the path to pusa_v0_dit.safetensors is correct
+dit_weights_path = "RaphaelLiu/Pusa-V0.5/pusa_v0_dit.safetensors" # Adjust if your download path is different
+pipeline.transformer.load_state_dict(torch.load(dit_weights_path), strict=False)
+# Example parameters for generation
+prompt = "The camera remains still, the man is surfing on a wave with his surfboard."
+# Create a dummy image for demonstration if actual image is not present
+# In a real scenario, replace with a path to a .jpg image file
+try:
+    image = Image.open("./demos/example.jpg").convert('RGB') # Assumes running from Pusa-VidGen root
+except FileNotFoundError:
+    print("Example image not found. Creating a dummy image for demonstration.")
+    image = Image.new('RGB', (512, 512), color = 'red')
+    # Save the dummy image for use by the script
+    image.save("./demos/example.jpg")
+image_path = "./demos/example.jpg"
+cond_position = 0
+num_steps = 30
+noise_multiplier = 0.4
+# Load and preprocess the image (using feature_extractor from the pipeline)
+image_tensor = pipeline.feature_extractor.preprocess(image, return_tensors="pt").pixel_values
+image_tensor = image_tensor.to(pipeline.device, pipeline.dtype)
+# Generate video
+video_frames = pipeline(
+    prompt=prompt,
+    image=image_tensor,
+    cond_position=cond_position,
+    num_inference_steps=num_steps,
+    noise_multiplier=noise_multiplier,
+    generator=torch.Generator(device=pipeline.device).manual_seed(0),
+).frames[0]
+# Save or display the video frames
+# Example: Save frames as a GIF (requires imageio, Pillow)
+import imageio
+output_gif_path = "output_video.gif"
+imageio.mimsave(output_gif_path, [Image.fromarray(f) for f in video_frames], fps=10)
+print(f"Video saved to {output_gif_path}")
+```
 ## Limitations
 Pusa currently has several known limitations:
+-   The base Mochi model generates videos at relatively low resolution (480p)
+-   We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
+-   We welcome community contributions to enhance model performance and extend its capabilities
 ## Related Work
+-   [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
+-   [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
 ## Citation
 ```
 @misc{Liu2025pusa,
+  title={Pusa: Thousands Timesteps Video Diffusion Model},
+  author={Yaofang Liu and Rui Liu},
+  year={2025},
+  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
 }
 ```
 ```
 @article{liu2024redefining,
+  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
+  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
+  journal={arXiv preprint arXiv:2410.03160},
+  year={2024}
 }
+```