Memory question
What is the system memory usage when using offloading? I'm running your example python unmodified with cpu offloading, and my 24gb vram AND 64gb system ram both fill completely, resulting in an OOM.
What's the error message?
Can you also try with
DFloat11Model.from_pretrained(
"DFloat11/Qwen-Image-DF11",
device="cpu",
cpu_offload=args.cpu_offload,
pin_memory=False,
bfloat16_model=transformer,
)
Thanks, and great project by the way, I'm quite curious how this could apply to video models.
But it looks like it's troubleshooting time. For reference, I'm on win11 with a 4090, python env is 3.12.7, I installed the latest stable torch+cu128, then installed packages as described on the card.
pin_memory starts to fill up the gpu, then offloads almost entirely to ram, leaving only ~3gb in vram while generating an image, which is of course quite slow.
Otherwise, without setting that, the OOM message looks like this, disregard my poor folder name choice for this:
The config attributes {'pooled_projection_dim': 768} were passed to QwenImageTransformer2DModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Loading DFloat11 safetensors (offloaded to CPU, memory pinned): 0%| | 0/1 [00:41<?, ?it/s]
Traceback (most recent call last):
File "H:\AI\Qwen-Image\qwen_image.py", line 39, in <module>
DFloat11Model.from_pretrained(
File "H:\AI\Qwen-Image\venv\Lib\site-packages\dfloat11\dfloat11.py", line 383, in from_pretrained
load_and_replace_tensors(model, dfloat11_model_path, dfloat11_config, cpu_offload=cpu_offload, pin_memory=pin_memory)
File "H:\AI\Qwen-Image\venv\Lib\site-packages\dfloat11\dfloat11.py", line 243, in load_and_replace_tensors
module.offloaded_tensors[parts[-1]] = tensor_value.pin_memory() if pin_memory else tensor_value
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.```
Thank you for the support!
DFloat11 is model-agnostic, and can be applied to any bfloat16 model. It works by applying entropy coding (specifically Huffman coding) on the exponent bits of model weights, without changing the underlying architecture. The compression is lossless, meaning you are getting exactly the same outputs as the original model.
It sounds like setting pin_memory=False
resolved the issue for you. Memory-pinning is a technique that makes CPU-GPU transfer significantly faster, but it unfortunately consumes a lot of CPU memory. If memory-pinning results in OOM, then setting it to false would currently be the best option.
Well, mostly solved, as it's barely using the gpu, but thanks again.
Facing the same issue. There's an initial load up of the model into VRAM (almost 16 GB), but then most of it is offloaded to the RAM, leaving only ~3GB filled in the VRAM. Hardly any speedup.
I was unable to use this. I have an RTX 5090 and it would load the entire thing into VRAM (~31GB) but then it just got stuck. Latest CUDA and all :/
Thank you for the support!
DFloat11 is model-agnostic, and can be applied to any bfloat16 model. It works by applying entropy coding (specifically Huffman coding) on the exponent bits of model weights, without changing the underlying architecture. The compression is lossless, meaning you are getting exactly the same outputs as the original model.
It sounds like setting
pin_memory=False
resolved the issue for you. Memory-pinning is a technique that makes CPU-GPU transfer significantly faster, but it unfortunately consumes a lot of CPU memory. If memory-pinning results in OOM, then setting it to false would currently be the best option.
yes disabling memory_pin makes it work but then it is super slow. Video generation models are faster on the same system (less than 10 mins). i am pretty sure what you have built should work but maybe you need to try and update that sample script. Here is whats happening.
RTX 3090 - Using 2.5GB VRAM (10% of avaialble)
System RAM usage - Close to 40GB
Time - over 20 mins for 50 steps.
If i dont disable ppinning ill get OOM fairly early when Pieline is being loaded.
It may be that something is being loaded incorrectly in your sample script. There is no reason for an OOM. Is beyond my understandign to fix this.
Thanks for the work and sharing
Thank you for your contribution!
I have the same problem as @void-mckenzie : "Facing the same issue. There's an initial load up of the model into VRAM (almost 16 GB), but then most of it is offloaded to the RAM, leaving only ~3GB filled in the VRAM. Hardly any speedup." It takes 10 seconds for each step but vram memory used is 3gb.
I have the same problem, 24GB vram and 64GB ram.python qwen-image.py --cpu_offload --no_pin_memory
: This works but only uses ~10% vram (2784MB)python qwen-image.py --cpu_offload
: This OOMs after using ~55GBpython qwen-image.py
: Not enough vram to run
I also tried to force the script to use cudaMallocManaged
, but as expected that did not work
Is there any way to make this run faster or make memory pinning use less memory?
the model is getting decompressed on the fly, so believe it will require a lot more RAM. I might be wrong but its packing the whole transformer+text_encoder+other s. But tthe two big ones are already above 60GB. ill have to test on something like 96 or 128GB which is still way better than usign a GPU of that size. but i believe you need around 60GB+ for the model alone so 64Gb is definitely going to OOM.
Question for author:
So After trying different methods. I wanted to ask, is there a way to selectively pin only the TE or only the transformer since everything isnt going to fit in memory. Its still better than default memory management tweaks specially since its lossless compression
I answered my own question abotu decomprpession No CPU decompression or host-device data transfer: all operations are handled entirely on the GPU.
so that would mean we cant just increase VRAM usage. maybe the usage is not being correctly reported.
Thank you for everyone's feedback!
I have added a feature in the DFloat11 package for configuring the number of blocks to offload, which means
- offloading more blocks uses less GPU memory and more CPU memory,
- offloading less blocks uses more GPU memory and less CPU memory, and could be faster.
This will allow you to configure the optimal number of blocks to offload for the best balance between memory-efficient and speed. To try it, upgrade to the latest pip version pip install -U dfloat11[cuda12]
and follow the instructions in the model README.
Thank you for everyone's feedback!
I have added a feature in the DFloat11 package for configuring the number of blocks to offload, which means
- offloading more blocks uses less GPU memory and more CPU memory,
- offloading less blocks uses more GPU memory and less CPU memory, and could be faster.
This will allow you to configure the optimal number of blocks to offload for the best balance between memory-efficient and speed. To try it, upgrade to the latest pip version
pip install -U dfloat11[cuda12]
and follow the instructions in the model README.
Thank you so much.