nielsr HF Staff commited on
Commit
f21852f
·
verified ·
1 Parent(s): 755e7e4

Add pipeline tag, license, and link to Github repository

Browse files

This PR adds the `pipeline_tag` as `text-to-image` to the model card, adds license information, and links to the Github repository.

Files changed (1) hide show
  1. README.md +235 -3
README.md CHANGED
@@ -1,10 +1,10 @@
1
  ---
2
- pipeline_tag: any-to-any
 
3
  tags:
4
  - Any2Any
5
  ---
6
 
7
-
8
  **Lumina-mGPT** is a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
9
 
10
  [![Lumina-mGPT](https://img.shields.io/badge/Paper-Lumina--mGPT-2b9348.svg?logo=arXiv)](https://arxiv.org/abs/2408.02657)
@@ -12,4 +12,236 @@ tags:
12
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6358a167f56b03ec9147074d/hgaCZdtmdlCDcZ8tb4Rme.png)
13
 
14
  # Usage
15
- We provide the implementation of Lumina-mGPT, as well as sampling code, in our [github repository](https://github.com/Alpha-VLLM/Lumina-mGPT).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-to-image
3
+ license: apache-2.0
4
  tags:
5
  - Any2Any
6
  ---
7
 
 
8
  **Lumina-mGPT** is a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions.
9
 
10
  [![Lumina-mGPT](https://img.shields.io/badge/Paper-Lumina--mGPT-2b9348.svg?logo=arXiv)](https://arxiv.org/abs/2408.02657)
 
12
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6358a167f56b03ec9147074d/hgaCZdtmdlCDcZ8tb4Rme.png)
13
 
14
  # Usage
15
+ We provide the implementation of Lumina-mGPT, as well as sampling code, in our [github repository](https://github.com/Alpha-VLLM/Lumina-mGPT).
16
+
17
+ <div align="center">
18
+
19
+ <img src="assets/logo.png" width="30%"/>
20
+
21
+ # Lumina-mGPT
22
+
23
+ <b> A family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. 👋 join our <a href="http://imagebind-llm.opengvlab.com/qrcode/" target="_blank">WeChat</a> </b>
24
+
25
+ [![Lumina-mGPT](https://img.shields.io/badge/Paper-Lumina--mGPT-2b9348.svg?logo=arXiv)](https://arxiv.org/abs/2408.02657)&#160;
26
+
27
+ [![Static Badge](https://img.shields.io/badge/Official(node1)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-mGPT)](http://106.14.2.150:10020/)&#160;
28
+ [![Static Badge](https://img.shields.io/badge/Official(node2)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-mGPT)](http://106.14.2.150:10021/)&#160;
29
+
30
+ </div>
31
+
32
+ <img src="assets/demos.png">
33
+
34
+ ## 📰 News
35
+
36
+ - **[2024-08-11] 🎉🎉🎉 [Training codes and documents](./lumina_mgpt/TRAIN.md) are released! 🎉🎉🎉**
37
+
38
+ - **[2024-07-08] 🎉🎉🎉 Lumina-mGPT is released! 🎉🎉🎉**
39
+
40
+ ## ⚙️ Installation
41
+
42
+ See [INSTALL.md](./INSTALL.md) for detailed instructions.
43
+
44
+ Note that the Lumina-mGPT implementation heavily relies on
45
+ the [xllmx](./xllmx) module, which is evolved from [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory) for supporting
46
+ LLM-centered multimodal tasks. Make sure it is installed correctly as a python package before going on.
47
+
48
+ ## ⛽ Training
49
+ See [lumina_mgpt/TRAIN.md](lumina_mgpt/TRAIN.md)
50
+
51
+ ## 📽️ Inference
52
+
53
+ > [!Note]
54
+ >
55
+ > Before using the Lumina-mGPT model, run
56
+ >
57
+ > ```bash
58
+ > # bash
59
+ > cd lumina_mgpt
60
+ > ```
61
+ >
62
+ > to enter the directory of the Lumina-mGPT implementation.
63
+
64
+ ### Perpetration
65
+
66
+ Since currently the Chameleon implementation in transformers does not contain the VQ-VAE decoder, please manually download the original VQ-VAE weights [provided by Meta](https://github.com/facebookresearch/chameleon) and
67
+ put them to the following directory:
68
+
69
+ ```
70
+ Lumina-mGPT
71
+ - lumina_mgpt/
72
+ - ckpts/
73
+ - chameleon/
74
+ - tokenizer/
75
+ - text_tokenizer.json
76
+ - vqgan.yaml
77
+ - vqgan.ckpt
78
+ - xllmx/
79
+ - ...
80
+ ```
81
+
82
+ ### Local Gradio Demos
83
+
84
+ We have prepared three different Gradio demos, each showcasing unique functionalities, to help you quickly become familiar with the capabilities of the Lumina-mGPT models.
85
+
86
+ #### 1. [demos/demo_image_generation.py](./Lumina-mGPT/demos/demo_image_generation.py)
87
+
88
+ This demo is customized for Image Generation tasks, where you can input a text description and generate a corresponding image.
89
+ To host this demo, run:
90
+
91
+ ```bash
92
+ # Note to set the `--target_size` argument consistent with the checkpoint
93
+ python -u demos/demo_image_generation.py \
94
+ --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768 \
95
+ --target_size 768
96
+ ```
97
+
98
+ #### 2. [demos/demo_image2image.py](./Lumina-mGPT/demos/demo_image2image.py)
99
+
100
+ This demo is designed for models trained with Omni-SFT. you can conveniently switch between the multiple downstream tasks using this demo.
101
+
102
+ ```bash
103
+ # Note to set the `--target_size` argument consistent with the checkpoint
104
+ python -u demos/demo_image2image.py \
105
+ --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni \
106
+ --target_size 768
107
+ ```
108
+
109
+ #### 3. [demos/demo_freeform.py](./Lumina-mGPT/demos/demo_freeform.py)
110
+
111
+ This is a powerful demo with minimal constraint on the input format. It supports flexible interation and is suitable for in-deep exploration.
112
+
113
+ ```bash
114
+ # Note to set the `--target_size` argument consistent with the checkpoint
115
+ python -u demos/demo_freeform.py \
116
+ --pretrained_path Alpha-VLLM/Lumina-mGPT-7B-768-Omni \
117
+ --target_size 768
118
+ ```
119
+
120
+ ### Simple Inference
121
+
122
+ The simplest code for Lumina-mGPT inference:
123
+
124
+ ```python
125
+ from inference_solver import FlexARInferenceSolver
126
+ from PIL import Image
127
+
128
+ # ******************** Image Generation ********************
129
+ inference_solver = FlexARInferenceSolver(
130
+ model_path="Alpha-VLLM/Lumina-mGPT-7B-768",
131
+ precision="bf16",
132
+ target_size=768,
133
+ )
134
+
135
+ q1 = f"Generate an image of 768x768 according to the following prompt:
136
+ "
137
+ f"Image of a dog playing water, and a waterfall is in the background."
138
+
139
+ # generated: tuple of (generated response, list of generated images)
140
+ generated = inference_solver.generate(
141
+ images=[],
142
+ qas=[[q1, None]],
143
+ max_gen_len=8192,
144
+ temperature=1.0,
145
+ logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
146
+ )
147
+
148
+ a1, new_image = generated[0], generated[1][0]
149
+
150
+
151
+ # ******************* Image Understanding ******************
152
+ inference_solver = FlexARInferenceSolver(
153
+ model_path="Alpha-VLLM/Lumina-mGPT-7B-512",
154
+ precision="bf16",
155
+ target_size=512,
156
+ )
157
+
158
+ # "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
159
+ q1 = "Describe the image in detail. <|image|>"
160
+
161
+ images = [Image.open("image.png")]
162
+ qas = [[q1, None]]
163
+
164
+ # `len(images)` should be equal to the number of appearance of "<|image|>" in qas
165
+ generated = inference_solver.generate(
166
+ images=images,
167
+ qas=qas,
168
+ max_gen_len=8192,
169
+ temperature=1.0,
170
+ logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
171
+ )
172
+
173
+ a1 = generated[0]
174
+ # generated[1], namely the list of newly generated images, should typically be empty in this case.
175
+
176
+
177
+ # ********************* Omni-Potent *********************
178
+ inference_solver = FlexARInferenceSolver(
179
+ model_path="Alpha-VLLM/Lumina-mGPT-7B-768-Omni",
180
+ precision="bf16",
181
+ target_size=768,
182
+ )
183
+
184
+ # Example: Depth Estimation
185
+ # For more instructions, see demos/demo_image2image.py
186
+ q1 = "Depth estimation. <|image|>"
187
+ images = [Image.open("image.png")]
188
+ qas = [[q1, None]]
189
+
190
+ generated = inference_solver.generate(
191
+ images=images,
192
+ qas=qas,
193
+ max_gen_len=8192,
194
+ temperature=1.0,
195
+ logits_processor=inference_solver.create_logits_processor(cfg=1.0, image_top_k=200),
196
+ )
197
+
198
+ a1 = generated[0]
199
+ new_image = generated[1][0]
200
+
201
+ ```
202
+
203
+ ## 🤗 Checkpoints
204
+
205
+ **Configurations**
206
+
207
+ <img src="assets/config2.jpg">
208
+ <img src="assets/config1.jpg">
209
+
210
+ **7B models**
211
+
212
+ | Model | Size | Huggingface |
213
+ | ------------ | ---- | ---------------------------------------------------------------------------------------- |
214
+ | FP-SFT@512 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-512](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-512) |
215
+ | FP-SFT@768 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-768](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-768) |
216
+ | Omni-SFT@768 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-768-Omni](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-768-Omni) |
217
+ | FP-SFT@1024 | 7B | [Alpha-VLLM/Lumina-mGPT-7B-1024](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-7B-1024) |
218
+
219
+ **34B models**
220
+
221
+ | Model | Size | Huggingface |
222
+ | ---------- | ---- | ------------------------------------------------------------------------------------ |
223
+ | FP-SFT@512 | 34B | [Alpha-VLLM/Lumina-mGPT-34B-512](https://huggingface.co/Alpha-VLLM/Lumina-mGPT-34B-512) |
224
+
225
+ More checkpoints coming soon.
226
+
227
+ ## 📑 Open-source Plan
228
+
229
+ - [X] Inference code
230
+ - [X] Training code
231
+
232
+ ## 🔥 Open positions
233
+ We are hiring interns, postdocs, and full-time researchers at the General Vision Group, Shanghai AI Lab, with a focus on multi-modality and vision foundation models. If you are interested, please contact [email protected].
234
+
235
+ ## 📄 Citation
236
+
237
+ ```
238
+ @misc{liu2024lumina-mgpt,
239
+ title={Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining},
240
+ author={Dongyang Liu and Shitian Zhao and Le Zhuo and Weifeng Lin and Yu Qiao and Hongsheng Li and Peng Gao},
241
+ year={2024},
242
+ eprint={2408.02657},
243
+ archivePrefix={arXiv},
244
+ primaryClass={cs.CV},
245
+ url={https://arxiv.org/abs/2408.02657},
246
+ }
247
+ ```