DFloat11 Compressed Model: black-forest-labs/FLUX.1-Kontext-dev
This is a DFloat11 losslessly compressed version of the original black-forest-labs/FLUX.1-Kontext-dev
model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.
π₯π₯π₯ Thanks to DFloat11 compression, FLUX.1-Kontext-dev can now run smoothly on a single 24GB GPU without any quality loss. π₯π₯π₯
π Performance Comparison
Metric | FLUX.1-Kontext-dev (BFloat16) | FLUX.1-Kontext-dev (DFloat11) |
---|---|---|
Model Size | 23.80 GB | 16.33 GB |
Peak GPU Memory (1024Γ1024 image generation) |
24.86 GB | 18.12 GB |
Generation Time (A100 GPU) |
72 seconds | 83 seconds |
π§ How to Use
Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):
pip install -U dfloat11[cuda12] # or if you have CUDA version 11: # pip install -U dfloat11[cuda11]
Install diffusers from the main branch until future stable release.
pip install git+https://github.com/huggingface/diffusers.git
To use the DFloat11 model, run the following example code in Python:
import torch from diffusers import FluxKontextPipeline from diffusers.utils import load_image from dfloat11 import DFloat11Model pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16) DFloat11Model.from_pretrained( "DFloat11/FLUX.1-Kontext-dev-DF11", device="cpu", bfloat16_model=pipe.transformer, ) pipe.enable_model_cpu_offload() input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png") image = pipe( image=input_image, prompt="Add a hat to the cat", guidance_scale=2.5, ).images[0] image.save("kontext.png")
π How It Works
We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.
The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.
Learn more in our research paper.
π Learn More
- Downloads last month
- 1
Model tree for DFloat11/FLUX.1-Kontext-dev-DF11
Base model
black-forest-labs/FLUX.1-Kontext-dev