Man, the RTX 20-series graphics cards don't support it. My RTX 2080 Ti with 22GB VRAM takes 10 minutes to generate a single image.
The Qwen-image model is based on BF16, but since the RTX 20-series graphics cards only support FP16, it ends up running in FP32 precision for image generation. Waiting for over ten minutes for a single image to render is really exhausting. I'm wondering if there's any possibility of technical optimization for this issue. Could you please take some time to help with this? I truly need to use this model.
This model is known to have very large activations, larger than fp16 could ever handle, so only bf16 precision is really viable. As a 2080Ti doesn't have native bf16, you're basically stuck with these long generation times.
Edit: this is basically this answer but rehashed, as you already asked the same question in that thread.
There seems to be a converted fp16 qwen image on civit. Idk if quanting that instead would help?
Taking me around an hour using a 5070 ti so I'm jealous of your 10 minutes.
I have it working on an RTX-2080-super 8GB VRAM, generating in 46.5 seconds.
Using a 4-step lightening workflow and the smallest GGUF models. But still looks fine, and I'm still experimenting with progressively increasing the GGUF sizes to find out what else can work.
The workflow I used (I have no affiliation with this guy) was as follows:
patreon (free download, no need to support him or be a patreon member), The Local Lab AI, August 18, Free Qwen-Image 4 Step Text to Image and Image to Image - ComfyUI Workflow & Guide
The GGUF models he links to OOM the 2080-super. But it works using using the following smaller GGUF files:
Qwen2.5-VL-7B-Instruct-GGUF
replaced with
Qwen2.5-VL-7B-Instruct-Q2_K.gguf
qwen-image-Q3_K_S.gguf
replaced with
qwen-image-Q2_K.gguf
Not suggesting it's optimum, but it's a working starting point for 20xx GPU-deficient users like me.
try to force the vllm part to run on cpu inference and use a q4 quant, smaller than that will have noticeable drop off, the lowest you should prob go is iq4xs but idk if that even runs in comfyui so maybe use q4_0 or q4_K_S (if you have enough system memory that is, but since you have a 2000 card you prob have ddr4 and an upgrade is cheap)
Also Q2 for qwen image is rather low, ofc if thats the only way to run it that sucks, but try to increase it to like q4, multi gpu distorch nodes work good for that (you can offload parts of the model to cpu with only around 5% speed decrease, but you also need some ram for that)
By using gguf, u instantly low your speed
@wsbagnsv1
recommended a good node MultiGpu that I also use 24/7 , and not only model loader, also CLIP (safetensors) & VAE from this node, because they fit better in RAM.
I have a 3070ti + 64ram, and i donate 24ram for vram, to obtain 32vram, stonks
okay whats next?
next,if you realy want to work faster with gguf and dont fear about comfyui crash, u can try install sage attention https://github.com/thu-ml/SageAttention/tree/main
fear? because 20xx series can have bad support of sage attention, and u can getting confused with versions of Triton, pytorch and cuda
for me its give +15% speed for all gguf by this guide https://civitai.com/articles/12848/step-by-step-guide-series-comfyui-installing-sageattention-2
while I was writing this comment, I checked out the github, and some guys successfully instaled sage attention 1 on 20xx
https://github.com/thu-ml/SageAttention/issues/197
fork https://github.com/woct0rdho/SageAttention/issues/20
Do your own research and good luck have fun with console :D
P.S. Make backup of nodes in Snapshot Manager (comfy manager, left side) :D also u can backup python_embed :DDD
I don't understand why this information is now missing from GitHub, but I once shared this screenshot with a friend with 20xx