DATAGRID-research commited on
Commit
5c1ff9d
·
verified ·
1 Parent(s): 86edaf0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -3
README.md CHANGED
@@ -1,3 +1,52 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LocalDiT
2
+ LocalDiT is a lightweight Diffusion Transformer model for high-quality text-to-image generation that incorporates local attention mechanisms to improve computational efficiency while maintaining generation quality.
3
+
4
+ # Model Description
5
+ LocalDiT builds upon the architecture of PixArt-α, introducing local attention mechanisms to reduce computational complexity and memory requirements. By processing image patches in local windows rather than with global attention, the model achieves faster inference and training while preserving image generation quality.
6
+
7
+ - **Type**: Diffusion Transformer (DiT) with Local Attention
8
+ - **Parameters**: 0.52B
9
+ - **Resolution**: Supports generation up to 1024×1024 pixels
10
+ - **Language Support**: English text prompts
11
+
12
+ # Usage
13
+ Details on code execution will be released at a later date.
14
+ ```python
15
+ from model import LocalDiTPipeline
16
+ import torch
17
+
18
+ pipe = LocalDiTPipeline.from_pretrained("datagrid/LocalDiT-1024", torch_dtype=torch.float16)
19
+ pipe = pipe.to("cuda")
20
+
21
+ prompt = "A cute cat sitting on a windowsill, digital art"
22
+ negative_prompt = "low quality, distorted, blurry"
23
+
24
+ image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0]
25
+ image.save("generated_image.png")
26
+ ```
27
+
28
+ # Training Details
29
+
30
+ - **Training Data**: Approximately 40M image-text pairs
31
+ - **Training Strategy**: Multi-stage resolution training (256px → 512px → 1024px)
32
+ - **Architecture Modifications**:
33
+ - Implemented window-based local attention in alternating transformer blocks
34
+ - Reduced parameter count through efficient attention design
35
+ - Optimized for both quality and computational efficiency
36
+
37
+ # Results
38
+ LocalDiT achieves comparable image quality to PixArt-α while offering:
39
+ - 20% reduction in model parameters
40
+ - Up to 30% faster inference speed
41
+ - Reduced memory footprint
42
+
43
+ # License
44
+ This model is released under the Apache 2.0 License.
45
+
46
+ # Limitations
47
+ The model primarily works with English text prompts
48
+ Like other text-to-image models, it may struggle with complex spatial relationships, text rendering, and accurate human anatomy
49
+ The model may inherit biases present in the training data
50
+
51
+ # Citation
52
+ Citation information will be provided at a later date.