luodian commited on
Commit
2219ee4
1 Parent(s): ad899ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +429 -3
README.md CHANGED
@@ -1,3 +1,429 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - lmms-lab/LLaVA-OneVision-Data
5
+ language:
6
+ - en
7
+ - zh
8
+ metrics:
9
+ - accuracy
10
+ library_name: transformers
11
+ tags:
12
+ - multimodal
13
+
14
+ model-index:
15
+ - name: llava-onevision-qwen-7b-ov
16
+ results:
17
+ - task:
18
+ type: multimodal
19
+ dataset:
20
+ type: ai2d
21
+ name: AI2D
22
+ metrics:
23
+ - name: accuracy
24
+ type: accuracy
25
+ value: 81.4
26
+ verified: true
27
+ - task:
28
+ type: multimodal
29
+ dataset:
30
+ type: chartqa
31
+ name: ChartQA
32
+ metrics:
33
+ - name: accuracy
34
+ type: accuracy
35
+ value: 80.0
36
+ verified: true
37
+ - task:
38
+ type: multimodal
39
+ dataset:
40
+ type: docvqa
41
+ name: DocVQA
42
+ metrics:
43
+ - name: accuracy
44
+ type: accuracy
45
+ value: 90.2
46
+ verified: true
47
+ - task:
48
+ type: multimodal
49
+ dataset:
50
+ type: infovqa
51
+ name: InfoVQA
52
+ metrics:
53
+ - name: accuracy
54
+ type: accuracy
55
+ value: 70.7
56
+ verified: true
57
+ - task:
58
+ type: multimodal
59
+ dataset:
60
+ type: mathverse
61
+ name: MathVerse
62
+ metrics:
63
+ - name: accuracy
64
+ type: accuracy
65
+ value: 26.2
66
+ verified: true
67
+ - task:
68
+ type: multimodal
69
+ dataset:
70
+ type: mathvista
71
+ name: MathVista
72
+ metrics:
73
+ - name: accuracy
74
+ type: accuracy
75
+ value: 63.2
76
+ verified: true
77
+ - task:
78
+ type: multimodal
79
+ dataset:
80
+ type: mmbench
81
+ name: MMBench
82
+ metrics:
83
+ - name: accuracy
84
+ type: accuracy
85
+ value: 80.8
86
+ verified: true
87
+ - task:
88
+ type: multimodal
89
+ dataset:
90
+ type: mme-perception
91
+ name: MME-Perception
92
+ metrics:
93
+ - name: score
94
+ type: score
95
+ value: 1580
96
+ verified: true
97
+ - task:
98
+ type: multimodal
99
+ dataset:
100
+ type: mme-cognition
101
+ name: MME-Cognition
102
+ metrics:
103
+ - name: score
104
+ type: score
105
+ value: 418
106
+ verified: true
107
+ - task:
108
+ type: multimodal
109
+ dataset:
110
+ type: mmmu
111
+ name: MMMU
112
+ metrics:
113
+ - name: accuracy
114
+ type: accuracy
115
+ value: 48.8
116
+ verified: true
117
+ - task:
118
+ type: multimodal
119
+ dataset:
120
+ type: mmvet
121
+ name: MMVet
122
+ metrics:
123
+ - name: accuracy
124
+ type: accuracy
125
+ value: 57.5
126
+ verified: true
127
+ - task:
128
+ type: multimodal
129
+ dataset:
130
+ type: mmstar
131
+ name: MMStar
132
+ metrics:
133
+ - name: accuracy
134
+ type: accuracy
135
+ value: 61.7
136
+ verified: true
137
+ - task:
138
+ type: multimodal
139
+ dataset:
140
+ type: seed-bench
141
+ name: Seed-Bench
142
+ metrics:
143
+ - name: accuracy
144
+ type: accuracy
145
+ value: 75.4
146
+ verified: true
147
+ - task:
148
+ type: multimodal
149
+ dataset:
150
+ type: science-qa
151
+ name: Science-QA
152
+ metrics:
153
+ - name: accuracy
154
+ type: accuracy
155
+ value: 96.0
156
+ verified: true
157
+ - task:
158
+ type: multimodal
159
+ dataset:
160
+ type: imagedc
161
+ name: ImageDC
162
+ metrics:
163
+ - name: accuracy
164
+ type: accuracy
165
+ value: 88.9
166
+ verified: true
167
+ - task:
168
+ type: multimodal
169
+ dataset:
170
+ type: mmlbench
171
+ name: MMLBench
172
+ metrics:
173
+ - name: accuracy
174
+ type: accuracy
175
+ value: 77.1
176
+ verified: true
177
+ - task:
178
+ type: multimodal
179
+ dataset:
180
+ type: realworldqa
181
+ name: RealWorldQA
182
+ metrics:
183
+ - name: accuracy
184
+ type: accuracy
185
+ value: 66.3
186
+ verified: true
187
+ - task:
188
+ type: multimodal
189
+ dataset:
190
+ type: vibe-eval
191
+ name: Vibe-Eval
192
+ metrics:
193
+ - name: accuracy
194
+ type: accuracy
195
+ value: 51.7
196
+ verified: true
197
+ - task:
198
+ type: multimodal
199
+ dataset:
200
+ type: llava-w
201
+ name: LLaVA-W
202
+ metrics:
203
+ - name: accuracy
204
+ type: accuracy
205
+ value: 90.7
206
+ verified: true
207
+ - task:
208
+ type: multimodal
209
+ dataset:
210
+ type: l-wilder
211
+ name: LLaVA-Wilder
212
+ metrics:
213
+ - name: accuracy
214
+ type: accuracy
215
+ value: 67.8
216
+ verified: true
217
+ - task:
218
+ type: multimodal
219
+ dataset:
220
+ type: actnet-qa
221
+ name: ActNet-QA
222
+ metrics:
223
+ - name: accuracy
224
+ type: accuracy
225
+ value: 56.6
226
+ verified: true
227
+ - task:
228
+ type: multimodal
229
+ dataset:
230
+ type: egoschema
231
+ name: EgoSchema
232
+ metrics:
233
+ - name: accuracy
234
+ type: accuracy
235
+ value: 60.1
236
+ verified: true
237
+ - task:
238
+ type: multimodal
239
+ dataset:
240
+ type: mlvu
241
+ name: MLVU
242
+ metrics:
243
+ - name: accuracy
244
+ type: accuracy
245
+ value: 64.7
246
+ verified: true
247
+ - task:
248
+ type: multimodal
249
+ dataset:
250
+ type: mvbench
251
+ name: MVBench
252
+ metrics:
253
+ - name: accuracy
254
+ type: accuracy
255
+ value: 56.7
256
+ verified: true
257
+ - task:
258
+ type: multimodal
259
+ dataset:
260
+ type: nextqa
261
+ name: NextQA
262
+ metrics:
263
+ - name: accuracy
264
+ type: accuracy
265
+ value: 79.4
266
+ verified: true
267
+ - task:
268
+ type: multimodal
269
+ dataset:
270
+ type: percepTest
271
+ name: PercepTest
272
+ metrics:
273
+ - name: accuracy
274
+ type: accuracy
275
+ value: 49.7
276
+ verified: true
277
+ - task:
278
+ type: multimodal
279
+ dataset:
280
+ type: seedbench
281
+ name: SeedBench
282
+ metrics:
283
+ - name: accuracy
284
+ type: accuracy
285
+ value: 56.9
286
+ verified: true
287
+ - task:
288
+ type: multimodal
289
+ dataset:
290
+ type: videochatgpt
291
+ name: VideoChatGPT
292
+ metrics:
293
+ - name: score
294
+ type: score
295
+ value: 3.49
296
+ verified: true
297
+ - task:
298
+ type: multimodal
299
+ dataset:
300
+ type: videodc
301
+ name: VideoDC
302
+ metrics:
303
+ - name: score
304
+ type: score
305
+ value: 3.75
306
+ verified: true
307
+ - task:
308
+ type: multimodal
309
+ dataset:
310
+ type: videomme
311
+ name: VideoMME
312
+ metrics:
313
+ - name: accuracy
314
+ type: accuracy
315
+ value: 58.2
316
+ verified: true
317
+ ---
318
+
319
+
320
+ # LLaVA-OneVision
321
+
322
+ ![banner](https://i.postimg.cc/pL17YtG4/WX20240508-220230-2x.png)
323
+
324
+ Play with the model on the [LLaVA OneVision Chat](https://llava-onevision.lmms-lab.com/).
325
+
326
+ ## Table of Contents
327
+
328
+ 1. [Model Summary](##model-summary)
329
+ 2. [Use](##use)
330
+ 3. [Limitations](##limitations)
331
+ 4. [Training](##training)
332
+ 5. [License](##license)
333
+ 6. [Citation](##citation)
334
+
335
+ ## Model Summary
336
+
337
+ The LLaVA-OneVision models are 0.5/7/72B parameter models trained on [LLaVA-OneVision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), based on Qwen2 language model with a context window of 32K tokens.
338
+
339
+ - **Repository:** [LLaVA-VL/LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT?tab=readme-ov-file)
340
+ - **Project Website:** [llava-onevision.lmms-lab.com](llava-onevision.lmms-lab.com)
341
+ - **Paper:** [LLaVA-OneVision]()
342
+ - **Point of Contact:** [Bo Li](mailto:[email protected])
343
+ - **Languages:** English, Chinese
344
+
345
+
346
+ ## Use
347
+
348
+ ### Intended use
349
+
350
+ The model was trained on [LLaVA-OneVision Dataset](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) and have the ability to interact with images, multi-image and videos.
351
+
352
+ **Feel free to share your generations in the Community tab!**
353
+
354
+ ### Generation
355
+ ```python
356
+ # pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
357
+ from llava.model.builder import load_pretrained_model
358
+ from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
359
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
360
+ from llava.conversation import conv_templates, SeparatorStyle
361
+
362
+ from PIL import Image
363
+ import requests
364
+ import copy
365
+ import torch
366
+
367
+ import sys
368
+ import warnings
369
+
370
+ warnings.filterwarnings("ignore")
371
+ pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-si"
372
+ model_name = "llava_qwen"
373
+ device = "cuda"
374
+ device_map = "auto"
375
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map) # Add any other thing you want to pass in llava_model_args
376
+
377
+ model.eval()
378
+
379
+ url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
380
+ image = Image.open(requests.get(url, stream=True).raw)
381
+ image_tensor = process_images([image], image_processor, model.config)
382
+ image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
383
+
384
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
385
+ question = DEFAULT_IMAGE_TOKEN + "\nWhat is shown in this image?"
386
+ conv = copy.deepcopy(conv_templates[conv_template])
387
+ conv.append_message(conv.roles[0], question)
388
+ conv.append_message(conv.roles[1], None)
389
+ prompt_question = conv.get_prompt()
390
+
391
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
392
+ image_sizes = [image.size]
393
+
394
+
395
+ cont = model.generate(
396
+ input_ids,
397
+ images=image_tensor,
398
+ image_sizes=image_sizes,
399
+ do_sample=False,
400
+ temperature=0,
401
+ max_new_tokens=4096,
402
+ )
403
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
404
+ print(text_outputs)
405
+ ```
406
+
407
+ # Training
408
+
409
+ ## Model
410
+
411
+ - **Architecture:** SO400M + Qwen2
412
+ - **Pretraining Stage:** LCS-558K, 1 epoch, projector
413
+ - **Mid Stage:** A mixture of 4.7M high-quality synthetic data, 1 epoch, full model
414
+ - **Final-Image Stage:** A mixture of 3.6M single-image data, 1 epoch, full model
415
+ - **OneVision Stage:** A mixture of 1.6M single-image/multi-image/video data, 1 epoch, full model
416
+ - **Precision:** bfloat16
417
+
418
+ ## Hardware & Software
419
+
420
+ - **GPUs:** 256 * Nvidia Tesla A100 (for whole model series training)
421
+ - **Orchestration:** [Huggingface Trainer](https://huggingface.co/docs/transformers/main_classes/trainer)
422
+ - **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
423
+
424
+ # Citation
425
+ ```
426
+ @article{li2024llavaonevision,
427
+ title={LLaVA-OneVision},
428
+ }
429
+ ```