Vchitect-XL

non-profit

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

ChenyangSi authored a paper about 2 months ago

VBench: Comprehensive Benchmark Suite for Video Generative Models

ChenyangSi authored a paper about 2 months ago

FSAR: Federated Skeleton-based Action Recognition with Adaptive Topology Structure and Knowledge Distillation

ChenyangSi authored a paper about 2 months ago

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

View all activity

a-r-r-o-w

posted an update 25 days ago

Post

3246

Caching is an essential technique used in diffusion inference serving for speeding up image/video generations. Diffusers just added support for another caching method: First Block Cache - a technique developed by @chengzeyi building upon the ideas of TeaCache.

The idea in short is: if the model predictions do not vary much over successive inference steps, we can skip certain steps where the prediction difference is small. To figure out whether an inference step will make a significant improvement to the overall velocity/noise prediction, we calculate the relative difference of the output of the first transformer block at timestep $t$ with $t-1$, and compare it against a selected threshold. If the difference is lower than the threshold, we skip the step. A higher threshold will lead to more steps being skipped. However, skipping many steps is bad because it can throw off the model predictions, and so we need to test and select the threshold based on level of quality-speed tradeoff for every model we use it with.

Diffusers usage with CogView4:

import torch
from diffusers import CogView4Pipeline
from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig

pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
pipe.to("cuda")

apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))

prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
image.save("output.png")

Below, you'll find the benchmarks and visualizations of the predicted output at different blocks of the Flux DiT.

Docs: https://huggingface.co/docs/diffusers/main/en/optimization/cache
PR: https://github.com/huggingface/diffusers/pull/11180

References:
- First Block Cache: https://github.com/chengzeyi/ParaAttention
- TeaCache: https://github.com/ali-vilab/TeaCache

1 reply

a-r-r-o-w

posted an update about 1 month ago

Post

2830

As you might have already heard, FLUX.1-Kontext-dev is now released and taken the generative community by storm!

In case you haven't come across it, you can get started with Kontext using 🤗 diffusers. See the official [model]( black-forest-labs/FLUX.1-Kontext-dev) and [docs](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux#flux).

Want to know how inference companies like Fal & Replicate are able to run the model so fast and in under 2 seconds per image? Check out this [gist](https://gist.github.com/a-r-r-o-w/d08c37e8bd3e9c26b4ce80360be148c6) for some details!

1 reply

a-r-r-o-w

posted an update about 2 months ago

Post

2288

New diffusion model for text-to-image and video-to-world generation: Cosmos Predict-2 👽

Model collection: nvidia/cosmos-predict2-68028efc052239369a0f2959
Diffusers support: https://github.com/huggingface/diffusers/pull/11695
Documentation: https://huggingface.co/docs/diffusers/main/en/api/pipelines/cosmos

These are results with the 2B param model. Imagine what you could do with the 14B version! Go check it out now!

1 reply

a-r-r-o-w

posted an update about 2 months ago

Post

1317

Did you know how simple it was to get started with your own custom compiler backend with torch.compile? What's stopping you from writing your own compiler?

import torch
from torch._functorch.partitioners import draw_graph

def compiler(fx_module: torch.fx.GraphModule, _):
    draw_graph(fx_module, f"compile.dot")
    return fx_module.forward

def capture(model, *inputs):
    compiled_model = torch.compile(model, backend=compiler)
    y = compiled_model(*inputs)
    y.sum().backward()

class MLP(torch.nn.Module):
    def __init__(self):
        super().__init__()
        
        self.linear_1 = torch.nn.Linear(16, 32)
        self.linear_2 = torch.nn.Linear(32, 16)
    
    def forward(self, x):
        x = self.linear_1(x)
        x = torch.nn.functional.silu(x)
        x = self.linear_2(x)
        return x

if __name__ == '__main__':
    model = MLP()
    model.to("mps")
    x = torch.randn(4, 16, device="mps", dtype=torch.float32)

    capture(model, x)

--------------

Part of https://huggingface.co/posts/a-r-r-o-w/231008365980283

1 reply

a-r-r-o-w

posted an update about 2 months ago

Post

2245

Recently, I've been focusing my learning on the following topics:
- Pytorch internals, specifically the inductor system (roughly ~1 month of experience)
- Triton internals (~8 moe)
- CUDA (~3 moe)
- Understanding fusion patterns in compilers and how to improve them (~1 moe)
- Parallelism strategies for large scale inference optimization (~6-7 moe)

I thought it would be nice to document it somewhere for no particular reason. Maybe someone will find it useful? It's also because I want to get into the habit of writing, but had no motivation to do so. Maybe writing short informal posts will help build the habit.

Since I don't have a personal site, and don't plan to create one in the near future, I think HF posts are best suited for short and informal documentation to share my little discoveries and learnings. If you're interested, strap in!

First post in this series will be on basic study of Pytorch's float32 matmuls and their Triton implementation (nothing much, just the tutorial available on the website), short dive into TF32 and their TFLOPS comparison on an A100 machine.