Diffusion CoT

Enterprise

non-profit

Activity Feed

AI & ML interests

diffusion

Recent Activity

JackyZhuo updated a dataset 8 days ago

diffusion-cot/unified-test-data

JackyZhuo published a dataset 8 days ago

diffusion-cot/unified-test-data

metazlb new activity about 2 months ago

diffusion-cot/Image-Verifier:Documentation for using the model

View all activity

JackyZhuo

updated a dataset 8 days ago

diffusion-cot/unified-test-data

Viewer • Updated 8 days ago • 175k

JackyZhuo

published a dataset 8 days ago

diffusion-cot/unified-test-data

Viewer • Updated 8 days ago • 175k

metazlb

in diffusion-cot/Image-Verifier about 2 months ago

Documentation for using the model

#2 opened about 2 months ago by

nupurkmr9

metazlb

in diffusion-cot/FLUX-Corrector about 2 months ago

Where to put the model? How to sue it?

#2 opened about 2 months ago by

ryg81

sayakpaul

posted an update about 2 months ago

Post

2655

Diffusers supports a good variety of quantization backends. It can be challenging to navigate through them, given the complex nature of diffusion pipelines in general.

So, @derekl35 set out to write a comprehensive guide that puts users in the front seat. Explore the different backends we support, learn the trade-offs they offer, and finally, check out the cool space we built that lets you compare quantization results.

Give it a go here:
https://lnkd.in/gf8Pi4-2

sayakpaul

posted an update about 2 months ago

Post

1722

Despite the emergence of combining LLM and DiT architectures for T2I synthesis, its design remains severely understudied.

This was done long ago and got into CVPR25 -- super excited to finally share it now, along with the data and code ♥️

We explore several architectural choices that affect this design. We provide an open & reproducible training recipe that works at scale.

Works like Playground v3 have already explored a deep fusion between an LLM and a DiT, sharing their representations through layerwise attention. They exhibit excellent performance on T2I.

Despite its compelling results and other performance virtues, it remains unexplored, which is what we want to improve in our work. Specifically, we take a pre-trained LLM (Gemma-2B) and trainable DiT, and set out to explore what makes a "good deep fusion" between the two for T2I.

We explore several key questions in the work, such as:

Q1: How should we do attention? We considered several alternatives. PixArt-Alpha like attention (cross-attention) is very promising.
Q2: Should we incorporate additional text modulation?
Q3: Can we eliminate timestep conditioning?
Q4: How do we do positional encodings?
Q5: Do instruction-tuned LLMs help deep fusion?
Q6: Would using a decoder LLM from a multimodal model be helpful?
Q7: Does using a better variant of Gemma help?

Based on the above findings, we arrive at FuseDiT with the following components on top of the base architecture from the findings of our experiments.

* No AdaLN-Zero modules
* 1D + 2D-RoPE
* Gemma 2 2B, adjusting DiT configurations accordingly

We trained FuseDiT on a mixture from CC12M, JourneyDB, & SA (~26M image-text pairs) for 800 steps. While not the best model, it's encouraging to develop something in a guided manner using open datasets.

To know more (code, models, all are available), please check out the paper:
https://lnkd.in/gg6qyqZX.