Hardware requirements
Hello. Thanks for great model!
I want to know what is minimal system requirements to run this model? Especially how much VRAM do i need?
The 384p version requires around 26GB memory, and the 768p version requires around 40GB memory (we do not have the exact number because the cache mechanism on 80GB GPU)
Thanks for the great work.
Suggestion: It would have been great if hardware requirements are mentioned in README.
Also, I would like to pick your brain for your opinion, do you think such a model will be eventually possible to be run on a smaller commercial grade GPU?
Both work with 24 GB VRAM in bf16 precision.
Both work with 24 GB VRAM in bf16 precision.
could you guys tell please how long approximately it take to generate 1 video ? its like 1-2 min or 10 min? for example on rtx 3090 or 4090 or so?
A 5 second video takes a little over a minute on a 3090. Longer videos can take considerably longer to generate.
I think the VRAM requirements are actually even lower than I expected, I was able to get it working at just a little over 12 GB VRAM max usage with better memory management, and the text encoder was the part that used the most (and we already have ways to reduce that, like using fp8, nf4, or GGUF). I think this wouldn't have any issues running on 16 GB VRAM.
how exactly did you manage to run this with lower vram? any tips?
A 5 second video takes a little over a minute on a 3090. Longer videos can take considerably longer to generate.
I think the VRAM requirements are actually even lower than I expected, I was able to get it working at just a little over 12 GB VRAM max usage with better memory management, and the text encoder was the part that used the most (and we already have ways to reduce that, like using fp8, nf4, or GGUF). I think this wouldn't have any issues running on 16 GB VRAM.
could you please tell us how to convert it to fp8 or bf16 to run this model on rtx 4090
You can use cpu_offloading=True
to only have the models loaded onto the GPU when they're needed. This lowers the memory requirement to about 12 GB.
You can load the models in bf16 from the fp32 weights by just setting model_dtype
to bf16 like in the demo notebook.
If you want to download the bf16 versions of the models directly instead of the fp32 ones, I've uploaded them here:
https://huggingface.co/SeanScripts/pyramid-flow-sd3-bf16/tree/main
I actually haven't figured out how to convert to fp8, like at all... If anyone knows, please let me know. As far as I can tell, I can make torch.float8_e4m3fn tensors, but I can't do any operations on them without getting an error saying they're not implemented in CUDA. Despite using them all the time in ComfyUI somehow. (Speaking of which, I may have a set of ComfyUI custom nodes for this model soon...)
A 5 second video takes a little over a minute on a 3090. Longer videos can take considerably longer to generate.
I think the VRAM requirements are actually even lower than I expected, I was able to get it working at just a little over 12 GB VRAM max usage with better memory management, and the text encoder was the part that used the most (and we already have ways to reduce that, like using fp8, nf4, or GGUF). I think this wouldn't have any issues running on 16 GB VRAM.
can you share your script, somehow I am unable to generate viedo in less than 7 minutes on RTX 3090 (and gpu memory is fully utilized)
I'm pretty much just using the code from the demo notebook. This takes under 2 minutes total on a 3090 for me.
Make sure you have the latest code from the inference repo.
Though I did just notice it's missing the empty cache steps during CPU offloading, that does make a big difference.
Without it, VRAM usage ends at 16 GB for this example. I'll go ahead and make a PR to add it...
import os
import json
import torch
import numpy as np
import PIL
from PIL import Image
from IPython.display import HTML
from pyramid_dit import PyramidDiTForVideoGeneration
from IPython.display import Image as ipython_image
from diffusers.utils import load_image, export_to_video, export_to_gif
variant='diffusion_transformer_384p' # For low resolution
model_path = "../pyramid-flow-sd3" # The downloaded checkpoint dir
model_dtype = "bf16"
device_id = 0
torch.cuda.set_device(device_id)
model = PyramidDiTForVideoGeneration(
model_path,
model_dtype,
model_variant=variant,
)
if model_dtype == "bf16":
torch_dtype = torch.bfloat16
elif model_dtype == "fp16":
torch_dtype = torch.float16
else:
torch_dtype = torch.float32
def show_video(ori_path, rec_path, width="100%"):
html = ''
if ori_path is not None:
html += f"""<video controls="" name="media" data-fullscreen-container="true" width="{width}">
<source src="{ori_path}" type="video/mp4">
</video>
"""
html += f"""<video controls="" name="media" data-fullscreen-container="true" width="{width}">
<source src="{rec_path}" type="video/mp4">
</video>
"""
return HTML(html)
prompt = "[prompt]"
output_name = "./text_to_video_sample.mp4"
# used for 384p model variant
width = 640
height = 384
temp = 16 # temp in [1, 31] <=> frame in [1, 241] <=> duration in [0, 10s]
torch.cuda.empty_cache()
model.vae.enable_tiling()
with torch.no_grad(), torch.cuda.amp.autocast(enabled=True if model_dtype != 'fp32' else False, dtype=torch_dtype):
frames = model.generate(
prompt=prompt,
num_inference_steps=[20, 20, 20],
video_num_inference_steps=[10, 10, 10],
height=height,
width=width,
temp=temp,
guidance_scale=9.0, # The guidance for the first frame
video_guidance_scale=5.0, # The guidance for the other video latent
output_type="pil",
save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
cpu_offloading=True, # Unload models after using them
)
export_to_video(frames, output_name, fps=24)
show_video(None, output_name, "70%")
ahh but it is 384 :/
is there a way to split it across two gpus?
You can still run the 768p one on 24 GB VRAM as well, it will just take longer.
Looks like Kijai beat me to the ComfyUI nodes. :)
Just tested the 768p model on a 3090 with cpu offloading. For a 5 second video it took about 8 minutes total. Max VRAM usage was around 18 GB.
is there a way to split it across two gpus?
Yes, we are supporting inference with multiple GPUs now. Stay tuned!
It's an amazing project! Congratulations for this, and thanks for turning this open source.
A dumb question: would it be possible to combine text generation and image to text to generate bigger videos, with more than 10s?
I'm trying this with a Radeon 7900XTX (24 GB VRAM). 384p videos render fine, but no dice with 768p using the full precision models and CPU offloading (80 GB system RAM):
File "/usr/lib/python3.10/pathlib.py", line 1290, in exists
self.stat()
File "/usr/lib/python3.10/pathlib.py", line 1097, in stat
return self._accessor.stat(self, follow_symlinks=follow_symlinks)
OSError: [Errno 36] File name too long: 'Error during video generation: HIP out of memory. Tried to allocate 8.46 GiB. G
PU 0 has a total capacity of 23.98 GiB of which 8.01 GiB is free. Of the allocated memory 14.52 GiB is allocated by PyTo
rch, and 1.08 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORC
H_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https:/pyto
rch.org/docs/stable/notes/cuda.html#environment-variables)'
Additionally, it's trying to save the video using the error message as the file name?
Disclaimer: I'm using Python 3.10 and PyTorch 2.4.1 (for ROCm 6.1).
It won't work with bf16 either β still tries to allocate too much VRAM.
I always got CUDA out of memory Error when trying to use 768p model, only 384p works for me. I ran on A100 GPU. Does anyone has the similar issue? Hope the developer team can fix it.
I try the following code but the output is a black video.
import os
import json
import torch
import numpy as np
import PIL
from PIL import Image
from IPython.display import HTML
from pyramid_dit import PyramidDiTForVideoGeneration
from IPython.display import Image as ipython_image
from diffusers.utils import load_image, export_to_video, export_to_gif
variant='diffusion_transformer_384p' # For low resolution
model_path = "pyramid_flow_model" # The downloaded checkpoint dir
model_dtype = "fp16"
device_id = 0
torch.cuda.set_device(device_id)
model = PyramidDiTForVideoGeneration(
model_path,
model_dtype,
model_variant=variant,
)
if model_dtype == "bf16":
torch_dtype = torch.bfloat16
elif model_dtype == "fp16":
torch_dtype = torch.float16
else:
torch_dtype = torch.float32
prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
output_name = "./text_to_video_sample.mp4"
# used for 384p model variant
width = 640
height = 384
temp = 16 # temp in [1, 31] <=> frame in [1, 241] <=> duration in [0, 10s]
torch.cuda.empty_cache()
model.vae.enable_tiling()
with torch.no_grad(), torch.cuda.amp.autocast(enabled=True if model_dtype != 'fp32' else False, dtype=torch_dtype):
frames = model.generate(
prompt=prompt,
num_inference_steps=[20, 20, 20],
video_num_inference_steps=[10, 10, 10],
height=height,
width=width,
temp=temp,
guidance_scale=9.0, # The guidance for the first frame
video_guidance_scale=5.0, # The guidance for the other video latent
output_type="pil",
save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
cpu_offloading=True, # Unload models after using them
)
export_to_video(frames, output_name, fps=24)
I am on a Tesla T4 (15.3 GB of GPU)
I try the following code but the output is a black video.
...I am on a Tesla T4 (15.3 GB of GPU)
Hi Same here, I run it on RTX2080 ti with 11GB VRAM
I found if I run in fp16 I got a balck video, but if I run it with fp32 I got good video.