Add pipeline tag, license, and link to Github repository
Browse filesThis PR adds the `pipeline_tag` as `text-to-image` to the model card, adds license information, and links to the Github repository.
README.md
CHANGED
@@ -1,10 +1,10 @@
|
|
1 |
---
|
2 |
-
pipeline_tag:
|
|
|
3 |
tags:
|
4 |
- Any2Any
|
5 |
---
|
6 |
|
7 |
-
|
8 |
**Lumina-mGPT** is a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
|
9 |
|
10 |
[](https://arxiv.org/abs/2408.02657)
|
@@ -12,4 +12,236 @@ tags:
|
|
12 |

|
13 |
|
14 |
# Usage
|
15 |
-
We provide the implementation of Lumina-mGPT, as well as sampling code, in our [github repository](https://github.com/Alpha-VLLM/Lumina-mGPT).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
pipeline_tag: text-to-image
|
3 |
+
license: apache-2.0
|
4 |
tags:
|
5 |
- Any2Any
|
6 |
---
|
7 |
|
|
|
8 |
**Lumina-mGPT** is a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
|
9 |
|
10 |
[](https://arxiv.org/abs/2408.02657)
|
|
|
12 |

|
13 |
|
14 |
# Usage
|
15 |
+
We provide the implementation of Lumina-mGPT, as well as sampling code, in our [github repository](https://github.com/Alpha-VLLM/Lumina-mGPT).
|
16 |
+
|
17 |
+
<div align="center">
|
18 |
+
|
19 |
+
<img src="assets/logo.png" width="30%"/>
|
20 |
+
|
21 |
+
# Lumina-mGPT
|
22 |
+
|
23 |
+
<b> A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. 👋 join our <a href="http://imagebind-llm.opengvlab.com/qrcode/" target="_blank">WeChat</a> </b>
|
24 |
+
|
25 |
+
[](https://arxiv.org/abs/2408.02657) 
|
26 |
+
|
27 |
+
[-6B88E3?logo=youtubegaming&label=Demo%20Lumina-mGPT)](http://106.14.2.150:10020/) 
|
28 |
+
[-6B88E3?logo=youtubegaming&label=Demo%20Lumina-mGPT)](http://106.14.2.150:10021/) 
|
29 |
+
|
30 |
+
</div>
|
31 |
+
|
32 |
+
<img src="assets/demos.png">
|
33 |
+
|
34 |
+
## 📰 News
|
35 |
+
|
36 |
+
- **[2024-08-11] 🎉🎉🎉 [Training codes and documents](./lumina_mgpt/TRAIN.md) are released! 🎉🎉🎉**
|
37 |
+
|
38 |
+
- **[2024-07-08] 🎉🎉🎉 Lumina-mGPT is released! 🎉🎉🎉**
|
39 |
+
|
40 |
+
## ⚙️ Installation
|
41 |
+
|
42 |
+
See [INSTALL.md](./INSTALL.md) for detailed instructions.
|
43 |
+
|
44 |
+
Note that the Lumina-mGPT implementation heavily relies on
|
45 |
+
the [xllmx](./xllmx) module, which is evolved from [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory) for supporting
|
46 |
+
LLM-centered multimodal tasks. Make sure it is installed correctly as a python package before going on.
|
47 |
+
|
48 |
+
## ⛽ Training
|
49 |
+
See [lumina_mgpt/TRAIN.md](lumina_mgpt/TRAIN.md)
|
50 |
+
|
51 |
+
## 📽️ Inference
|
52 |
+
|
53 |
+
> [!Note]
|
54 |
+
>
|
55 |
+
> Before using the Lumina-mGPT model, run
|
56 |
+
>
|
57 |
+
> ```bash
|
58 |
+
> # bash
|
59 |
+
> cd lumina_mgpt
|
60 |
+
> ```
|
61 |
+
>
|
62 |
+
> to enter the directory of the Lumina-mGPT implementation.
|
63 |
+
|
64 |
+
### Perpetration
|
65 |
+
|
66 |
+
Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights [provided by Meta](https://github.com/facebookresearch/chameleon) and
|
67 |
+
put them to the following directory:
|
68 |
+
|
69 |
+
```
|
70 |
+
Lumina-mGPT
|
71 |
+
- lumina_mgpt/
|
72 |
+
- ckpts/
|
73 |
+
- chameleon/
|
74 |
+
- tokenizer/
|
75 |
+
- text_tokenizer.json
|
76 |
+
- vqgan.yaml
|
77 |
+
- vqgan.ckpt
|
78 |
+
- xllmx/
|
79 |
+
- ...
|
80 |
+
```
|
81 |
+
|
82 |
+
### Local Gradio Demos
|
83 |
+
|
84 |
+
We have prepared three different Gradio demos, each showcasing unique functionalities, to help you quickly become familiar with the capabilities of the Lumina-mGPT models.
|
85 |
+
|
86 |
+
#### 1. [demos/demo_image_generation.py](./Lumina-mGPT/demos/demo_image_generation.py)
|
87 |
+
|
88 |
+
This demo is customized for Image Generation tasks, where you can input a text description and generate a corresponding image.
|
89 |
+
To host this demo, run:
|
90 |
+
|
91 |
+
```bash
|
92 |
+
# Note to set the `--target_size` argument consistent with the checkpoint
|
93 |
+
python -u demos/demo_image_generation.py \
|
94 |
+
--pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768 \
|
95 |
+
--target_size 768
|
96 |
+
```
|
97 |
+
|
98 |
+
#### 2. [demos/demo_image2image.py](./Lumina-mGPT/demos/demo_image2image.py)
|
99 |
+
|
100 |
+
This demo is designed for models trained with Omni-SFT. you can conveniently switch between the multiple downstream tasks using this demo.
|
101 |
+
|
102 |
+
```bash
|
103 |
+
# Note to set the `--target_size` argument consistent with the checkpoint
|
104 |
+
python -u demos/demo_image2image.py \
|
105 |
+
--pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni \
|
106 |
+
--target_size 768
|
107 |
+
```
|
108 |
+
|
109 |
+
#### 3. [demos/demo_freeform.py](./Lumina-mGPT/demos/demo_freeform.py)
|
110 |
+
|
111 |
+
This is a powerful demo with minimal constraint on the input format. It supports flexible interation and is suitable for in-deep exploration.
|
112 |
+
|
113 |
+
```bash
|
114 |
+
# Note to set the `--target_size` argument consistent with the checkpoint
|
115 |
+
python -u demos/demo_freeform.py \
|
116 |
+
--pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni \
|
117 |
+
--target_size 768
|
118 |
+
```
|
119 |
+
|
120 |
+
### Simple Inference
|
121 |
+
|
122 |
+
The simplest code for Lumina-mGPT inference:
|
123 |
+
|
124 |
+
```python
|
125 |
+
from inference_solver import FlexARInferenceSolver
|
126 |
+
from PIL import Image
|
127 |
+
|
128 |
+
# ******************** Image Generation ********************
|
129 |
+
inference_solver = FlexARInferenceSolver(
|
130 |
+
model_path="Alpha-VLLM/Lumina-mGPT-7B-768",
|
131 |
+
precision="bf16",
|
132 |
+
target_size=768,
|
133 |
+
)
|
134 |
+
|
135 |
+
q1 = f"Generate an image of 768x768 according to the following prompt:
|
136 |
+
"
|
137 |
+
f"Image of a dog playing water, and a waterfall is in the background."
|
138 |
+
|
139 |
+
# generated: tuple of (generated response, list of generated images)
|
140 |
+
generated = inference_solver.generate(
|
141 |
+
images=[],
|
142 |
+
qas=[[q1, None]],
|
143 |
+
max_gen_len=8192,
|
144 |
+
temperature=1.0,
|
145 |
+
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
|
146 |
+
)
|
147 |
+
|
148 |
+
a1, new_image = generated[0], generated[1][0]
|
149 |
+
|
150 |
+
|
151 |
+
# ******************* Image Understanding ******************
|
152 |
+
inference_solver = FlexARInferenceSolver(
|
153 |
+
model_path="Alpha-VLLM/Lumina-mGPT-7B-512",
|
154 |
+
precision="bf16",
|
155 |
+
target_size=512,
|
156 |
+
)
|
157 |
+
|
158 |
+
# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
|
159 |
+
q1 = "Describe the image in detail. <|image|>"
|
160 |
+
|
161 |
+
images = [Image.open("image.png")]
|
162 |
+
qas = [[q1, None]]
|
163 |
+
|
164 |
+
# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
|
165 |
+
generated = inference_solver.generate(
|
166 |
+
images=images,
|
167 |
+
qas=qas,
|
168 |
+
max_gen_len=8192,
|
169 |
+
temperature=1.0,
|
170 |
+
logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
|
171 |
+
)
|
172 |
+
|
173 |
+
a1 = generated[0]
|
174 |
+
# generated[1], namely the list of newly generated images, should typically be empty in this case.
|
175 |
+
|
176 |
+
|
177 |
+
# ********************* Omni-Potent *********************
|
178 |
+
inference_solver = FlexARInferenceSolver(
|
179 |
+
model_path="Alpha-VLLM/Lumina-mGPT-7B-768-Omni",
|
180 |
+
precision="bf16",
|
181 |
+
target_size=768,
|
182 |
+
)
|
183 |
+
|
184 |
+
# Example: Depth Estimation
|
185 |
+
# For more instructions, see demos/demo_image2image.py
|
186 |
+
q1 = "Depth estimation. <|image|>"
|
187 |
+
images = [Image.open("image.png")]
|
188 |
+
qas = [[q1, None]]
|
189 |
+
|
190 |
+
generated = inference_solver.generate(
|
191 |
+
images=images,
|
192 |
+
qas=qas,
|
193 |
+
max_gen_len=8192,
|
194 |
+
temperature=1.0,
|
195 |
+
logits_processor=inference_solver.create_logits_processor(cfg=1.0, image_top_k=200),
|
196 |
+
)
|
197 |
+
|
198 |
+
a1 = generated[0]
|
199 |
+
new_image = generated[1][0]
|
200 |
+
|
201 |
+
```
|
202 |
+
|
203 |
+
## 🤗 Checkpoints
|
204 |
+
|
205 |
+
**Configurations**
|
206 |
+
|
207 |
+
<img src="assets/config2.jpg">
|
208 |
+
<img src="assets/config1.jpg">
|
209 |
+
|
210 |
+
**7B models**
|
211 |
+
|
212 |
+
| Model | Size | Huggingface |
|
213 |
+
| ------------ | ---- | ---------------------------------------------------------------------------------------- |
|
214 |
+
| FP-SFT@512 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-512](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-512) |
|
215 |
+
| FP-SFT@768 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-768](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-768) |
|
216 |
+
| Omni-SFT@768 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-768-Omni](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-768-Omni) |
|
217 |
+
| FP-SFT@1024 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-1024](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-1024) |
|
218 |
+
|
219 |
+
**34B models**
|
220 |
+
|
221 |
+
| Model | Size | Huggingface |
|
222 |
+
| ---------- | ---- | ------------------------------------------------------------------------------------ |
|
223 |
+
| FP-SFT@512 | 34B | [Alpha-VLLM/Lumina-mGPT-34B-512](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-34B-512) |
|
224 |
+
|
225 |
+
More checkpoints coming soon.
|
226 |
+
|
227 |
+
## 📑 Open-source Plan
|
228 |
+
|
229 |
+
- [X] Inference code
|
230 |
+
- [X] Training code
|
231 |
+
|
232 |
+
## 🔥 Open positions
|
233 |
+
We are hiring interns, postdocs, and full-time researchers at the General Vision Group, Shanghai AI Lab, with a focus on multi-modality and vision foundation models. If you are interested, please contact [email protected].
|
234 |
+
|
235 |
+
## 📄 Citation
|
236 |
+
|
237 |
+
```
|
238 |
+
@misc{liu2024lumina-mgpt,
|
239 |
+
title={Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining},
|
240 |
+
author={Dongyang Liu and Shitian Zhao and Le Zhuo and Weifeng Lin and Yu Qiao and Hongsheng Li and Peng Gao},
|
241 |
+
year={2024},
|
242 |
+
eprint={2408.02657},
|
243 |
+
archivePrefix={arXiv},
|
244 |
+
primaryClass={cs.CV},
|
245 |
+
url={https://arxiv.org/abs/2408.02657},
|
246 |
+
}
|
247 |
+
```
|