Update README.md
Browse files
README.md
CHANGED
@@ -8,6 +8,8 @@ LocalDiT builds upon the architecture of [PixArt-α](https://huggingface.co/PixA
|
|
8 |
- **Parameters**: 0.52B
|
9 |
- **Resolution**: Supports generation up to 1024×1024 pixels
|
10 |
- **Language Support**: English text prompts
|
|
|
|
|
11 |
|
12 |
# Usage
|
13 |
Details on code execution will be released at a later date.
|
@@ -33,6 +35,10 @@ image.save("generated_image.png")
|
|
33 |
- Implemented window-based local attention in alternating transformer blocks
|
34 |
- Reduced parameter count through efficient attention design
|
35 |
- Optimized for both quality and computational efficiency
|
|
|
|
|
|
|
|
|
36 |
|
37 |
# Results
|
38 |
LocalDiT achieves comparable image quality to PixArt-α while offering:
|
|
|
8 |
- **Parameters**: 0.52B
|
9 |
- **Resolution**: Supports generation up to 1024×1024 pixels
|
10 |
- **Language Support**: English text prompts
|
11 |
+
- **Text Encoder**: FLAN-T5-XXL (4.3B parameters)
|
12 |
+
- **VAE**: SDXL VAE for high-quality latent encoding/decoding
|
13 |
|
14 |
# Usage
|
15 |
Details on code execution will be released at a later date.
|
|
|
35 |
- Implemented window-based local attention in alternating transformer blocks
|
36 |
- Reduced parameter count through efficient attention design
|
37 |
- Optimized for both quality and computational efficiency
|
38 |
+
- **Components**:
|
39 |
+
- Diffusion Backbone: Custom LocalDiT architecture (0.52B parameters)
|
40 |
+
- Text Encoder: FLAN-T5-XXL (4.3B parameters) for rich text embedding
|
41 |
+
- VAE: SDXL's Variational Autoencoder for high-fidelity latent space encoding/decoding
|
42 |
|
43 |
# Results
|
44 |
LocalDiT achieves comparable image quality to PixArt-α while offering:
|