AI & ML interests

None defined yet.

a-r-r-o-w 
posted an update 2 days ago
view post
Post
1728
You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.

In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.

Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.

The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!

Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8

(Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)
  • 3 replies
·
sayakpaul 
posted an update 13 days ago
view post
Post
866
Fast LoRA inference for Flux with Diffusers and PEFT 🚨

There are great materials that demonstrate how to optimize inference for popular image generation models, such as Flux. However, very few cover how to serve LoRAs fast, despite LoRAs being an inseparable part of their adoption.

In our latest post, @BenjaminB and I show different techniques to optimize LoRA inference for the Flux family of models for image generation. Our recipe includes the use of:

1. torch.compile
2. Flash Attention 3 (when compatible)
3. Dynamic FP8 weight quantization (when compatible)
4. Hotswapping for avoiding recompilation during swapping new LoRAs 🤯

We have tested our recipe with Flux.1-Dev on both H100 and RTX 4090. We achieve at least a *2x speedup* in either of the GPUs. We believe our recipe is grounded in the reality of how LoRA-based use cases are generally served. So, we hope this will be beneficial to the community 🤗

Even though our recipe was tested primarily with NVIDIA GPUs, it should also work with AMD GPUs.

Learn the details and the full code here:
https://huggingface.co/blog/lora-fast
a-r-r-o-w 
posted an update 28 days ago
view post
Post
3256
Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache.

The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.

Diffusers usage with CogView4:

import torch
from diffusers import CogView4Pipeline
from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.to("cuda")

apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))

prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
image.save("output.png")


Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.

Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180

References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache
  • 1 reply
·
a-r-r-o-w 
posted an update about 1 month ago
view post
Post
2832
As you might have already heard, FLUX.1-Kontext-dev is now released and taken the generative community by storm!

In case you haven't come across it, you can get started with Kontext using 🤗 diffusers. See the official [model]( black-forest-labs/FLUX.1-Kontext-dev) and [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux#flux).

Want to know how inference companies like Fal & Replicate are able to run the model so fast and in under 2 seconds per image? Check out this [gist](https://gist.github.com/a-r-r-o-w/d08c37e8bd3e9c26b4ce80360be148c6) for some details!
  • 1 reply
·
multimodalart 
posted an update about 2 months ago
view post
Post
9076
Self-Forcing - a real-time video distilled model from Wan 2.1 by @adobe is out, and they open sourced it 🐐

I've built a live real time demo on Spaces 📹💨

multimodalart/self-forcing
·
a-r-r-o-w 
posted an update about 2 months ago
a-r-r-o-w 
posted an update about 2 months ago
view post
Post
1317
Did you know how simple it was to get started with your own custom compiler backend with torch.compile? What's stopping you from writing your own compiler?

import torch
from torch._functorch.partitioners import draw_graph

def compiler(fx_module: torch.fx.GraphModule, _):
    draw_graph(fx_module, f"compile.dot")
    return fx_module.forward

def capture(model, *inputs):
    compiled_model = torch.compile(model, backend=compiler)
    y = compiled_model(*inputs)
    y.sum().backward()

class MLP(torch.nn.Module):
    def __init__(self):
        super().__init__()
        
        self.linear_1 = torch.nn.Linear(16, 32)
        self.linear_2 = torch.nn.Linear(32, 16)
    
    def forward(self, x):
        x = self.linear_1(x)
        x = torch.nn.functional.silu(x)
        x = self.linear_2(x)
        return x

if __name__ == '__main__':
    model = MLP()
    model.to("mps")
    x = torch.randn(4, 16, device="mps", dtype=torch.float32)

    capture(model, x)


--------------

Part of https://huggingface.co/posts/a-r-r-o-w/231008365980283
  • 1 reply
·
a-r-r-o-w 
posted an update about 2 months ago
view post
Post
2245
Recently, I've been focusing my learning on the following topics:
- Pytorch internals, specifically the inductor system (roughly ~1 month of experience)
- Triton internals (~8 moe)
- CUDA (~3 moe)
- Understanding fusion patterns in compilers and how to improve them (~1 moe)
- Parallelism strategies for large scale inference optimization (~6-7 moe)

I thought it would be nice to document it somewhere for no particular reason. Maybe someone will find it useful? It's also because I want to get into the habit of writing, but had no motivation to do so. Maybe writing short informal posts will help build the habit.

Since I don't have a personal site, and don't plan to create one in the near future, I think HF posts are best suited for short and informal documentation to share my little discoveries and learnings. If you're interested, strap in!

First post in this series will be on basic study of Pytorch's float32 matmuls and their Triton implementation (nothing much, just the tutorial available on the website), short dive into TF32 and their TFLOPS comparison on an A100 machine.
·
sayakpaul 
posted an update 3 months ago
view post
Post
2774
Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.

So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.

Give it a go here:
https://lnkd.in/gf8Pi4-2
sayakpaul 
posted an update 3 months ago
view post
Post
1745
Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.

This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️

We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.

Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.

Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.

We explore several key questions in the work, such as:

Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?

Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.

* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly

We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.

To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.