DFloat11 Compressed Model: black-forest-labs/FLUX.1-Krea-dev

This is a DFloat11 losslessly compressed version of the original black-forest-labs/FLUX.1-Krea-dev model. It reduces model size by 32% compared to the original BFloat16 model, while maintaining bit-identical outputs and supporting efficient GPU inference.

πŸ”₯πŸ”₯πŸ”₯ Thanks to DFloat11 compression, FLUX.1-Krea-dev can now run on a single 24GB GPU, or on a 12GB GPU with CPU offloading, while maintaining full model quality. πŸ”₯πŸ”₯πŸ”₯

πŸ“Š Performance Comparison

Model Model Size Peak GPU Memory (1024Γ—1024 image generation) Generation Time (A100 GPU)
FLUX.1-Krea-dev (BFloat16) 23.80 GB 24.28 GB 56 seconds
FLUX.1-Krea-dev (DFloat11) 16.33 GB 17.54 GB 58 seconds
FLUX.1-Krea-dev (DFloat11 + GPU Offloading) 16.33 GB 9.76 GB 78 seconds

πŸ”§ How to Use

  1. Install or upgrade the DFloat11 pip package (installs the CUDA kernel automatically; requires a CUDA-compatible GPU and PyTorch installed):

    pip install -U dfloat11[cuda12]
    
  2. Install or upgrade diffusers:

    pip install -U diffusers
    
  3. Save the following code to a Python file krea.py:

    import argparse
    import time
    
    import torch
    from diffusers import FluxPipeline
    from dfloat11 import DFloat11Model
    
    # Parse command line arguments
    parser = argparse.ArgumentParser(description="Generate images using FLUX.1-Krea-dev model")
    parser.add_argument(
        "--prompt", type=str, help="Text prompt for image generation",
        default="An astronaut, helmet off, sits at a tiny table set on the tip of a crescent moon, sipping tea while gazing at a swirling galaxy in the distance. Stars twinkle around, casting a gentle glow on the lunar surface.",
    )
    parser.add_argument("--width", type=int, default=1024, help="Image width")
    parser.add_argument("--height", type=int, default=1024, help="Image height")
    parser.add_argument("--guidance_scale", type=float, default=4.5, help="Guidance scale for generation")
    parser.add_argument("--save_file_name", type=str, default="flux-krea-dev.png", help="Output file name")
    parser.add_argument("--cpu_offload", action="store_true", help="Enable DFloat11 CPU offloading")
    args = parser.parse_args()
    
    # Load the pipeline
    pipe = FluxPipeline.from_pretrained(
        "black-forest-labs/FLUX.1-Krea-dev",
        torch_dtype=torch.bfloat16,
    )
    
    # Load DFloat11 model
    DFloat11Model.from_pretrained(
        "DFloat11/FLUX.1-Krea-dev-DF11",
        bfloat16_model=pipe.transformer,
        device="cpu",
        cpu_offload=args.cpu_offload,
    )
    pipe.enable_model_cpu_offload()
    
    start_time = time.time()
    # Generate image
    image = pipe(
        args.prompt,
        height=args.height,
        width=args.width,
        guidance_scale=args.guidance_scale,
    ).images[0]
    end_time = time.time()
    
    # Save the image
    image.save(args.save_file_name)
    
    # Print time and memory usage
    print(f"Time taken: {end_time - start_time:.2f} seconds")
    peak_memory = torch.cuda.max_memory_allocated()
    print(f"Peak memory: {peak_memory / 1000 ** 3:.2f} GB")
    
  4. To run without CPU offloading (18GB VRAM required):

    python krea.py
    

    To run with CPU offloading (10GB VRAM required):

    python krea.py --cpu_offload
    

πŸ” How It Works

We apply Huffman coding to losslessly compress the exponent bits of BFloat16 model weights, which are highly compressible (their 8 bits carry only ~2.6 bits of actual information). To enable fast inference, we implement a highly efficient CUDA kernel that performs on-the-fly weight decompression directly on the GPU.

The result is a model that is ~32% smaller, delivers bit-identical outputs, and achieves performance comparable to the original BFloat16 model.

Learn more in our research paper.

πŸ“„ Learn More

Downloads last month
13,709
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DFloat11/FLUX.1-Krea-dev-DF11

Quantized
(6)
this model