VRAM req

#5
by linitini - opened

This is sick. Y'all are the goats.

I see the space is using A100s.How much VRAM is typically required for schnell? Can it fit on an A10 (24GiB)?

Largest model I have seen. Not downloading till I know what the system resources have to be.

Depends how fast you want inference to go. If you're going for max speed, I'd estimate Flux uses about 33 GB VRAM (4.5B params for the text encoder, 12B params for the DiT, add the params up and multiply by 2 (bf16 - 2 bytes / param)). If you use model CPU offloading, you'll sacrifice a bit of performance but VRAM usage will go down to 24 GB (because of the DiT).

I've been able to generate 2048x2048 images (batch size 2) and 1024x1024 (batch size 4) on 6 GB VRAM by using sequential CPU offloading as well as VAE tiling and slicing, although it's really slow. Here's my code:

from diffusers import FluxPipeline
import torch

ckpt_id = "black-forest-labs/FLUX.1-schnell"
prompt = [
    "an astronaut riding a horse",
    # more prompts here
]
height, width = 1024, 1024

# denoising
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    torch_dtype=torch.bfloat16,
)
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
pipe.enable_sequential_cpu_offload() # offloads modules to CPU on a submodule level (rather than model level)

image = pipe(
    prompt,
    num_inference_steps=1,
    guidance_scale=0.0,
    height=height,
    width=width,
).images[0]
print('Max mem allocated (GB) while denoising:', torch.cuda.max_memory_allocated() / (1024 ** 3))

import matplotlib.pyplot as plt
plt.imshow(image)
plt.show()

Note: To maximize GPU utilization, increase the number of prompts/number of images generated per prompt (on my card, I doubled my GPU utilization by increasing the number of images generated before I run out of VRAM)

Reference documentation:
https://huggingface.co/docs/diffusers/v0.29.2/en/optimization/memory
https://huggingface.co/blog/sd3#memory-optimizations-for-sd3

An 8-bit quantized model can run on a laptop with 64 GB of RAM and 8 GB of VRAM through ComfyUI (4 steps), although it is as slow as SD3 (30 steps)

fp8 model:https://huggingface.co/Kijai/flux-fp8

Depends how fast you want inference to go. If you're going for max speed, I'd estimate Flux uses about 33 GB VRAM (4.5B params for the text encoder, 12B params for the DiT, add the params up and multiply by 2 (bf16 - 2 bytes / param)). If you use model CPU offloading, you'll sacrifice a bit of performance but VRAM usage will go down to 24 GB (because of the DiT).

I've been able to generate 2048x2048 images (batch size 2) and 1024x1024 (batch size 4) on 6 GB VRAM by using sequential CPU offloading as well as VAE tiling and slicing, although it's really slow. Here's my code:

from diffusers import FluxPipeline
import torch

ckpt_id = "black-forest-labs/FLUX.1-schnell"
prompt = "an astronaut riding a horse"
prompt = [prompt] * 2
height, width = 2048, 2048

# denoising
pipe = FluxPipeline.from_pretrained(
    ckpt_id,
    revision="refs/pr/1",
    torch_dtype=torch.bfloat16,
)
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
pipe.enable_sequential_cpu_offload() # offloads modules to CPU on a submodule level (rather than model level)

image = pipe(
    prompt,
    num_inference_steps=1,
    guidance_scale=0.0,
    height=height,
    width=width,
).images[0]
print('Max mem allocated (GB) while denoising:', torch.cuda.max_memory_allocated() / (1024 ** 3))

import matplotlib.pyplot as plt
plt.imshow(image)
plt.show()

Also, Flux seems to need 24 GB of CPU RAM when loading the model checkpoints (if there's an 8-bit version out there I'd reckon you would only need 12 GB of CPU RAM).

Reference documentation:
https://huggingface.co/docs/diffusers/v0.29.2/en/optimization/memory
https://huggingface.co/blog/sd3#memory-optimizations-for-sd3

are you sure i used your code it's just give me noise latent image any idea ?

@ABDALLALSWAITI Oh I think it's because Flux wasn't trained on 2048x2048 images so the images are more likely to turn out bad especially if num_inference_steps = 1. I modified the script so it should be more likely to work, my bad.

@latentCall145 hold my beer: Works on my Dell Inspiron 15 Gaming (CPU matters when you offload work onto the CPU to make the model fit on a) NVIDIA GeForce 1050 mobile (4GB VRAM):
I ran this code: https://github.com/InServiceOfX/InServiceOfX/blob/master/PythonLibraries/HuggingFace/MoreDiffusers/morediffusers/Applications/terminal_only_finite_loop_flux.py
I made this: https://www.instagram.com/p/C-U61P2p0jG/?utm_source=ig_web_copy_link&igsh=MzRlODBiNWFlZA== https://x.com/inserviceofx/status/1820790765670252776
For CPU offloading to work, I went down to resolution 608 x 880 for image, otherwise, any larger and CUDA run out of memory error is obtained.

try this merged quantized model version of FLUX https://civitai.com/models/629858

image.png

Does anyone have an idea if using Tensorrt to increase inference speed of the flux-schnell going to work at all?

@sonam-shrish Tensorrt definitely could increase inference speed but I haven’t tried it. I’ve already been able to get 40% speedups (on my 3060, your mileage may vary) by wrapping the normalization/RoPE layers with torch.compile and replacing the Linear layers with FP16 accumulate linear layers (which are faster than PyTorch Linear layers for consumer graphics cards). My optimizations aren’t available through diffusers (I had to patch some source code to do this), but TensorRT does all of this and more (e.g. int8 quantization, CUDAGraphs, etc.) so I wouldn’t be surprised if TensorRT gets a 2x speedup for flux schnell.

it run on colab t4

zzzzzzzzzzzzzzzzzzzzzzzzz.png

how to run lllyasviel/flux1-dev-bnb-nf4 with python code loke this code

from diffusers import FluxPipeline
import torch

ckpt_id = "black-forest-labs/FLUX.1-schnell"
prompt = "an astronaut riding a horse"
prompt = [prompt] * 2
height, width = 2048, 2048

denoising

pipe = FluxPipeline.from_pretrained(
ckpt_id,
revision="refs/pr/1",
torch_dtype=torch.bfloat16,
)
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
pipe.enable_sequential_cpu_offload() # offloads modules to CPU on a submodule level (rather than model level)

image = pipe(
prompt,
num_inference_steps=1,
guidance_scale=0.0,
height=height,
width=width,
).images[0]
print('Max mem allocated (GB) while denoising:', torch.cuda.max_memory_allocated() / (1024 ** 3))

import matplotlib.pyplot as plt
plt.imshow(image)
plt.show()

lllyasviel/flux1-dev-bnb-nf4

Sign up or log in to comment