DFloat11 Compressed Model: black-forest-labs/FLUX.1-Kontext-dev

This is a DFloat11 losslessly compressed version of the original black-forest-labs/FLUX.1-Kontext-dev model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.

πŸ”₯πŸ”₯πŸ”₯ Thanks to DFloat11 compression, FLUX.1-Kontext-dev can now run smoothly on a single 24GB GPU without any quality loss. πŸ”₯πŸ”₯πŸ”₯

πŸ“Š Performance Comparison

Metric FLUX.1-Kontext-dev (BFloat16) FLUX.1-Kontext-dev (DFloat11)
Model Size 23.80 GB 16.33 GB
Peak GPU Memory
(1024Γ—1024 image generation)
24.86 GB 18.12 GB
Generation Time
(A100 GPU)
72 seconds 83 seconds

πŸ”§ How to Use

  1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):

    pip install -U dfloat11[cuda12]
    # or if you have CUDA version 11:
    # pip install -U dfloat11[cuda11]
    
  2. Install diffusers from the main branch until future stable release.

    pip install git+https://github.com/huggingface/diffusers.git
    
  3. To use the DFloat11 model, run the following example code in Python:

    import torch
    from diffusers import FluxKontextPipeline
    from diffusers.utils import load_image
    from dfloat11 import DFloat11Model
    
    pipe = FluxKontextPipeline.from_pretrained("black-forest-labs/FLUX.1-Kontext-dev", torch_dtype=torch.bfloat16)
    DFloat11Model.from_pretrained(
        "DFloat11/FLUX.1-Kontext-dev-DF11",
        device="cpu",
        bfloat16_model=pipe.transformer,
    )
    pipe.enable_model_cpu_offload()
    
    input_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
    
    image = pipe(
        image=input_image,
        prompt="Add a hat to the cat",
        guidance_scale=2.5,
    ).images[0]
    
    image.save("kontext.png")
    

πŸ” How It Works

We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.

Learn more in our research paper.

πŸ“„ Learn More

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DFloat11/FLUX.1-Kontext-dev-DF11

Quantized
(4)
this model