ywlee88 commited on
Commit
bb00812
·
1 Parent(s): 59dbe59

update README

Browse files
Files changed (1) hide show
  1. README.md +59 -3
README.md CHANGED
@@ -1,13 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
  # KOALA-700M Model Card
2
 
3
- ## Model Discription
4
- KOALA, which stands for **KnOwledge-distillAtion in LAtent diffusion model**, marks a notable advancement in text-to-image (T2I) synthesis technology. This model is engineered to balance speed and performance effectively, making it ideal for resource-limited environments. By emphasizing self-attention in knowledge distillation, KOALA significantly enhances the accessibility and efficiency of high-quality text-to-image synthesis, particularly in settings with constrained resources. This approach represents a major leap forward in the field of T2I technology.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
  ## Key Features
7
  - **Efficient U-Net Architecture**: KOALA models use a simplified U-Net architecture that reduces the model size by up to 54% and 69% respectively compared to its predecessor, Stable Diffusion XL (SDXL).
8
  - **Self-Attention-Based Knowledge Distillation**: The core technique in KOALA focuses on the distillation of self-attention features, which proves crucial for maintaining image generation quality.
9
 
10
- ## Model Architecture
11
 
12
  ## Usage with 🤗[Diffusers library](https://github.com/huggingface/diffusers)
13
  The inference code with denoising step 25
 
1
+ <div align="center">
2
+ <img src="https://dl.dropboxusercontent.com/scl/fi/yosvi68jvyarbvymxc4hm/github_logo.png?rlkey=r9ouwcd7cqxjbvio43q9b3djd&dl=1" width="1024px" />
3
+ </div>
4
+
5
+ > **[KOALA: Self-Attention Matters in Knowledge Distillation of Latent Diffusion Models for Memory-Efficient and Fast Image Synthesis](http://arxiv.org/abs/2312.04005)**<br>
6
+ > [Youngwan Lee](https://github.com/youngwanLEE)<sup>1,2</sup>, [Kwanyong Park](https://pkyong95.github.io/)<sup>1</sup>, [Yoorhim Cho](https://ofzlo.github.io/)<sup>3</sup>, [Young-Ju Lee](https://scholar.google.com/citations?user=6goOQh8AAAAJ&hl=en)<sup>1</sup>, [Sung Ju Hwang](http://www.sungjuhwang.com/)<sup>2,4</sup> <br>
7
+ > <sup>1</sup>ETRI <sup>2</sup>KAIST, <sup>3</sup>SMWU, <sup>4</sup>DeepAuto.ai <be>
8
+
9
+ <a href="https://youngwanlee.github.io/KOALA/"><img src="https://img.shields.io/static/v1?label=Project%20Page&message=Github&color=blue&logo=github-pages"></a> &ensp;
10
+ <a href="https://github.com/youngwanLEE/sdxl-koala"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a> &ensp;
11
+ <a href="https://arxiv.org/abs/2312.04005"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv:KOALA&color=red&logo=arxiv"></a> &ensp;
12
+
13
  # KOALA-700M Model Card
14
 
15
+
16
+ ## Abstract
17
+ ### TL;DR
18
+ > We propose a fast text-to-image model, called KOALA, by compressing SDXL's U-Net and distilling knowledge from SDXL into our model. KOALA-700M can generate a 1024x1024 image in less than 1.5 seconds on an NVIDIA 4090 GPU, which is more than 2x faster than SDXL.
19
+
20
+ <details><summary>FULL abstract</summary>
21
+ Stable diffusion is the mainstay of the text-to-image (T2I) synthesis in the community due to its generation performance and open-source nature.
22
+ Recently, Stable Diffusion XL (SDXL), the successor of stable diffusion, has received a lot of attention due to its significant performance improvements with a higher resolution of 1024x1024 and a larger model.
23
+ However, its increased computation cost and model size require higher-end hardware (e.g., bigger VRAM GPU) for end-users, incurring higher costs of operation.
24
+ To address this problem, in this work, we propose an efficient latent diffusion model for text-to-image synthesis obtained by distilling the knowledge of SDXL.
25
+ To this end, we first perform an in-depth analysis of the denoising U-Net in SDXL, which is the main bottleneck of the model, and then design a more efficient U-Net based on the analysis.
26
+ Secondly, we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and eventually identify four essential factors, the core of which is that self-attention is the most important part.
27
+ With our efficient U-Net and self-attention-based knowledge distillation strategy, we build our efficient T2I models, called KOALA-1B &-700M, while reducing the model size up to 54% and 69% of the original SDXL model.
28
+ In particular, the KOALA-700M is more than twice as fast as SDXL while still retaining a decent generation quality.
29
+ We hope that due to its balanced speed-performance tradeoff, our KOALA models can serve as a cost-effective alternative to SDXL in resource-constrained environments.
30
+ </details>
31
+
32
+ <br>
33
+
34
+
35
+ These 1024x1024 samples are generated by KOALA-700M with 25 denoising steps.
36
+
37
+ <div align="center">
38
+ <img src="https://dl.dropboxusercontent.com/scl/fi/rjsqqgfney7be069y2yr7/teaser.png?rlkey=7lq0m90xpjcoqclzl4tieajpo&dl=1" width="1024px" />
39
+ </div>
40
+
41
+
42
+ ## Architecture
43
+ There are two two types of compressed U-Net, KOALA-1B and KOALA-700M, which are realized by reducing residual blocks and transformer blocks.
44
+
45
+ <div align="center">
46
+ <img src="https://dl.dropboxusercontent.com/scl/fi/5ydeywgiyt1d3njw63dpk/arch.png?rlkey=1p6imbjs4lkmfpcxy153i1a2t&dl=1" width="1024px" />
47
+ </div>
48
+
49
+
50
+ ## Latency and memory usage comparison on different GPUs
51
+
52
+ We measure the inference time of SDM-v2.0 with 768x768 resolution and the other models with 1024x1024 using a variety of consumer-grade GPUs: NVIDIA 3060Ti (8GB), 2080Ti (11GB), and 4090 (24GB). We use 25 denoising steps and FP16/FP32 precisions. OOM means Out-of-Memory. Note that SDXL-Base cannot operate in the 8GB-GPU.
53
+
54
+
55
+ <div align="center">
56
+ <img src="https://dl.dropboxusercontent.com/scl/fi/u1az20y0zfww1l5lhbcyd/latency_gpu.svg?rlkey=vjn3gpkmywmp7jpilar4km7sd&dl=1" width="1024px" />
57
+ </div>
58
+
59
+
60
+
61
+
62
 
63
  ## Key Features
64
  - **Efficient U-Net Architecture**: KOALA models use a simplified U-Net architecture that reduces the model size by up to 54% and 69% respectively compared to its predecessor, Stable Diffusion XL (SDXL).
65
  - **Self-Attention-Based Knowledge Distillation**: The core technique in KOALA focuses on the distillation of self-attention features, which proves crucial for maintaining image generation quality.
66
 
 
67
 
68
  ## Usage with 🤗[Diffusers library](https://github.com/huggingface/diffusers)
69
  The inference code with denoising step 25