🔥 Announcing FLUX-Juiced: The Fastest Image Generation Endpoint (2.6 times faster)!

Community Article Published April 23, 2025

Flux Juiced

Over the past few years, image generation models have become incredibly powerful—but also incredibly slow. Models like FLUX.1, with their massive architectures, regularly take 6+ seconds per image, even on cutting-edge GPUs like the H100. While compression is a widely applied technique that can reduce inference time, its impact on quality often remains unclear, so we decided to do two things.

  1. We built FLUX-juiced—the fastest FLUX.1 endpoint available today. And it’s now live on Replicate. 🚀
  2. We created InferBench, by running a comprehensive benchmark comparing FLUX-juiced with the “FLUX.1 [dev]” endpoints offered by different inference providers like Replicate, Fal, Fireworks and Together! And it’s now live on Hugging Face 🤗.

⚡ What’s FLUX-juiced?

FLUX-juiced is our optimized version of FLUX.1, delivering up to 2.6x faster inference than the official Replicate API, without sacrificing image quality.

Under the hood, it uses a custom combination of:

  • Graph compilation for optimized execution paths
  • Inference-time caching for repeated operations

We won’t go deep into the internals here, but here’s the gist:

We combine compiler-level execution graph optimization with selective caching of heavy operations (like attention layers), allowing inference to skip redundant computations without any loss in fidelity.

These techniques are generalized and plug-and-play via the Pruna Pro pipeline, and can be applied to nearly any diffusion-based image model—not just FLUX. For a free but still very juicy model you can use our open source solution.

🧪 Try FLUX-juiced now → replicate.com/prunaai/flux.1-juiced

📊 InferBench: Comparing Speed, Cost and Quality

To back it up, we ran a comprehensive benchmark comparing FLUX-juiced with the “FLUX.1 [dev]” endpoints offered by:

All of these inference providers offer FLUX.1 [dev] implementations but they don’t always communicate about the optimisation methods used in the background, and most endpoint have different response times and performance measure.

For comparison purposes, we used the same generation configuration and hardware among the different providers.

  • 28 inference steps
  • 1024×1024 resolution
  • Guidance scale of 3.5
  • H100 GPU (80GB)—only reported by Replicate

Although we did test with this configuration and hardware, the applied compression methods work with different config and hardware too!

While the full results of this benchmark are published in a InferBench Space on Hugging Face 🤗, you will learn about the key findings in this blog.

🗂️ Datasets & Benchmarks

To make a fair comparison we decided to use datasets and benchmarks containing prompts, images, and human annotations designed to evaluate the capabilities of text-to-image generation models in a standardized way.

Name Description Source
DrawBench A comprehensive set of prompts designed to evaluate text-to-image models across various capabilities, including rendering colors, object counts, spatial relations, and scene text. Dataset, Paper,
HPSv2 A large-scale dataset capturing human preferences on images from diverse sources, aimed at evaluating the alignment of generated images with human judgments. GitHub, Paper
GenAI-Bench A benchmark designed to assess multimodal large language models' ability to judge AI-generated content quality by comparing model evaluations with human preferences. Dataset, GitHub, Website, Paper
GenEval An object-focused framework for evaluating text-to-image alignment using existing object detection methods to produce fine-grained instance-level analysis. GitHub, Paper
PartiPrompts A rich set of over 1600 English prompts designed to measure model capabilities across various categories and challenge aspects. Dataset, GitHub, Website, Paper

📐 Metrics and scoring

Along with the datasets we used various scoring measure to quantitatively and objectively assess the performance of text-to-image generation models across aspects such as image quality, relevance to the prompt, and alignment with human preferences.

Name Description Source
ImageReward A reward model trained on 137k human preference comparisons to evaluate text-to-image generation quality. It serves as an automatic metric for assessing synthesis quality. Paper, GitHub
VQA-Score (model='clip-flant5-xxl') A vision-language generative model fine-tuned for image-text retrieval tasks, providing scores that reflect the alignment between images and textual descriptions. GitHub, Paper
CLIP-IQA An image quality assessment metric based on the CLIP model, measuring visual content quality by calculating the cosine similarity between images and predefined prompts. Paper, TorchMetrics
CLIP-Score A reference-free metric evaluating the correlation between generated captions and image content, as well as similarity between texts or images, using the CLIP model. Paper, TorchMetrics
CMMD CLIP Maximum Mean Discrepancy (CMMD) is used to evaluate image generation models. CMMD stands out as a better metric than FID and tries to mitigate the longstanding issues of FID. GitHub, Paper
ARNIQA A no-reference image quality assessment metric that predicts the technical quality of an image with high correlation to human judgments, included in the TorchMetrics library. Paper, TorchMetrics
Sharpness (Variance of Laplacian) A method to quantify image sharpness by calculating the variance of the Laplacian, where higher variance indicates a sharper image. Blogpost, Blogpost 2

🏆 Results

🕷️ Comparison overview

As discussed above, we evaluated the usage of the models across various benchmarks and metrics. The figure underneath gives a comparison overview of our findings.

image/png

FLUX-juiced clearly leads the pack! While the model achieves comparable quality performance, it is much more efficient: it can generate 180 image for a dollar and takes only 2.5 second for one image while the base model needs 6s.

🏎️ Speed Comparison

Most compression techniques tradeoff inference speed against quality. To find out where the Flux-juiced endpoint lies within this tradeoff, we plotted different quality dimensions against the measured inference speed.

quality_vs_speed-cropped.svg

The various FLUX-juiced versions form the Pareto front, meaning that for any given speed, no other API delivers the same or faster latency without compromising quality.

FLUX-extra-juiced only takes 2.5 seconds per image compared to the baseline’s 7 seconds, a speedup that becomes significant at scale. Generating 1 million images translates to an approximate saving of 18 hours in compute runtime.

💸 Cost Comparison

Generating images at scale using these API’s is not cheap - most require about $25,000 to produce 1M images. Therefore we also consider quality versus cost.

Cost-cropped.svg

The result? FLUX-juiced consistently sits on the Pareto front—delivering top-tier quality, at top-tier speed for a top-tier price. When generating 1M images, you could be saving $20.000,- in costs.

🖼️ Side-by-Side Comparison

We’ve also created a website that compares the outputs of FLUX-juiced with those of the baseline across 600 prompts. Take a look for yourself.

0453_comparison.png

🧃Get a juiced Version of your Model

Using our Pruna Pro engine, we created FLUX-juiced with just 10 lines of code. And you can do it too, as this snippet works for almost every 🤗 Diffusers pipeline.

import torch
from diffusers import FluxPipeline
from pruna_pro import SmashConfig, smash

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["cacher"] = "taylor_auto"
# lightly juiced (0.6), juiced (0.5), extra juiced (0.4)
smash_config["taylor_auto_speed_factor"] = 0.4
smash_token = "<your-token>"
smashed_pipe = smash(
    model=pipe,
    token=smash_token,
    smash_config=smash_config,
)

smashed_pipe("A cute, round, knitted purple prune.").images[0]

⏭️ What’s Next?

We have been releasing public models on Hugging Face already, and we plan on releasing many more models across inference providers like replicate! We will not only focus on image generation but will likely tackle other modalities as well, implementing the latest and greatest optimization techniques!

Ready to make your own models go faster?!

Let us know what models you want to see juiced next. And if you’re building with diffusion models—we’d love to hear from you.

Stay juicy! We know we will.


Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment