Any optimization ways to accelerate the speed of inference

by mayukitan - opened Nov 22

Nov 22

The inference time increased with a big jump from 90s to 180s on a single device for cogvideox-5b to 550s-1000s now. Just wondering if there is any solution to reduce inference time, thanks!

zRzRzRzRzRzRzR

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Nov 22

This significant event growth is due to the substantial increase in video frame rate and resolution, which requires a lot of time with the same computing power, as the amount of computation has expanded several times over.

Ravencwn

28 days ago

Using the H800, the inference time shows it will take 4 hours. Is this normal?

THUDM-Space

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org 28 days ago

Have you switched to pipe.to("cuda") as mentioned in the cli_demo.py in the GitHub repository?

Ravencwn

28 days ago

Have you switched to pipe.to("cuda") as mentioned in the cli_demo.py in the GitHub repository?

I used it, but it still takes a long time:

Ravencwn

28 days ago

Here is my code.

tsqn

28 days ago

@Ravencwn comment out pipe.enable_sequential_cpu_offload():

CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process.

...but inference is much slower...
https://huggingface.co/docs/diffusers/main/en/optimization/memory#cpu-offloading

Ravencwn

28 days ago

@tsqn Thanks！Could you tell me how long it takes for you to perform inference using CogVideoX1.5？

tsqn

27 days ago

•

edited 27 days ago

@Ravencwn hard to tell, it depends on how many steps/frames per second/output size you will generate. In example to be able generate anything on free A100(ZeroGPU space) i have to lower steps and generated frames to fit generation time less than 120 seconds you can try (but this is not 1.5version) - CogVideoX-5B-24frames_20steps-low_vram

I've also tried This CogVideoX1.5-5B-I2V workflow for ComfyUI locally - i have RTX 3060 12GB and 53 frames, 29 steps and dimensions 704x448 takes about 1 hour with GGUF version of model with vae_tilling enabled but without cpu offload(because of using quant version) - i'm fresh too ^^

@EDIT :
i've installed flash_attention_2 and from 1 hour time decrased inferencing to 8 minutes in 352x352 + 50 steps + 49 frames + 16fps so i don't know what to tell. Pretty crazy.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment