finalf0 commited on
Commit
d40be70
·
1 Parent(s): 200c189

update readme

Browse files
Files changed (1) hide show
  1. README.md +3 -1376
README.md CHANGED
@@ -24,1379 +24,6 @@ tags:
24
 
25
  <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
26
 
27
- [GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn)</a>
28
-
29
-
30
- ## MiniCPM-o 2.6
31
-
32
- **MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
33
-
34
- - 🔥 **Leading Visual Capability.**
35
- MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
36
-
37
- - 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
38
-
39
- - 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
40
-
41
- - 💪 **Strong OCR Capability and Others.**
42
- Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
43
- Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.
44
-
45
-
46
- - 🚀 **Superior Efficiency.**
47
- In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
48
-
49
- - 💫 **Easy Usage.**
50
- MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
51
-
52
-
53
- **Model Architecture.**
54
-
55
- - **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
56
- - **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
57
- - **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates end-to-end voice cloning and description-based voice creation.
58
-
59
- <div align="center">
60
- <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
61
- </div>
62
-
63
-
64
- ### Evaluation <!-- omit in toc -->
65
-
66
- <div align="center">
67
- <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/radar.jpg" width=90% />
68
- </div>
69
-
70
- <details>
71
- <summary>Click to view visual understanding results.</summary>
72
-
73
- **Image Understanding**
74
-
75
- <div align="center">
76
- <table style="margin: 0px auto;">
77
- <thead>
78
- <tr>
79
- <th align="left">Model</th>
80
- <th>Size</th>
81
- <th>Token Density<sup>+</sup></th>
82
- <th>OpenCompass</th>
83
- <th>OCRBench</th>
84
- <th>MathVista mini</th>
85
- <th>ChartQA</th>
86
- <th>MMVet</th>
87
- <th>MMStar</th>
88
- <th>MME</th>
89
- <th>MMB1.1 test</th>
90
- <th>AI2D</th>
91
- <th>MMMU val</th>
92
- <th>HallusionBench</th>
93
- <th>TextVQA val</th>
94
- <th>DocVQA test</th>
95
- <th>MathVerse mini</th>
96
- <th>MathVision</th>
97
- <th>MMHal Score</th>
98
- </tr>
99
- </thead>
100
- <tbody align="center">
101
- <tr>
102
- <td colspan="19" align="left"><strong>Proprietary</strong></td>
103
- </tr>
104
- <tr>
105
- <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
106
- <td>-</td>
107
- <td>1088</td>
108
- <td><u>69.9</u></td>
109
- <td>736</td>
110
- <td>61.3</td>
111
- <td>85.7</td>
112
- <td><strong>69.1</strong></td>
113
- <td>63.9</td>
114
- <td>2328.7</td>
115
- <td>82.2</td>
116
- <td>84.6</td>
117
- <td><strong>69.2</strong></td>
118
- <td><strong>55.0</strong></td>
119
- <td>-</td>
120
- <td>92.8</td>
121
- <td><strong>50.2</strong></td>
122
- <td><strong>30.4</strong></td>
123
- <td><u>3.6</u></td>
124
- </tr>
125
- <tr>
126
- <td nowrap="nowrap" align="left">Claude3.5-Sonnet</td>
127
- <td>-</td>
128
- <td>750</td>
129
- <td>67.9</td>
130
- <td>788</td>
131
- <td>61.6</td>
132
- <td><strong>90.8</strong></td>
133
- <td>66.0</td>
134
- <td>62.2</td>
135
- <td>1920.0</td>
136
- <td>78.5</td>
137
- <td>80.2</td>
138
- <td><u>65.9</u></td>
139
- <td>49.9</td>
140
- <td>-</td>
141
- <td><strong>95.2</strong></td>
142
- <td>-</td>
143
- <td>-</td>
144
- <td>3.4</td>
145
- </tr>
146
- <tr>
147
- <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
148
- <td>-</td>
149
- <td>-</td>
150
- <td>64.4</td>
151
- <td>754</td>
152
- <td>57.7</td>
153
- <td>81.3</td>
154
- <td>64.0</td>
155
- <td>59.1</td>
156
- <td>2110.6</td>
157
- <td>73.9</td>
158
- <td>79.1</td>
159
- <td>60.6</td>
160
- <td>45.6</td>
161
- <td>73.5</td>
162
- <td>86.5</td>
163
- <td>-</td>
164
- <td>19.2</td>
165
- <td>-</td>
166
- </tr>
167
- <tr>
168
- <td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td>
169
- <td>-</td>
170
- <td>1088</td>
171
- <td>64.1</td>
172
- <td>785</td>
173
- <td>52.4</td>
174
- <td>-</td>
175
- <td>66.9</td>
176
- <td>54.8</td>
177
- <td>2003.4</td>
178
- <td>76.0</td>
179
- <td>77.8</td>
180
- <td>60.0</td>
181
- <td>46.1</td>
182
- <td>-</td>
183
- <td>-</td>
184
- <td>-</td>
185
- <td>-</td>
186
- <td>3.3</td>
187
- </tr>
188
- <tr>
189
- <td colspan="19" align="left"><strong>Open Source</strong></td>
190
- </tr>
191
- <tr>
192
- <td nowrap="nowrap" align="left">Cambrian-34B</td>
193
- <td>34B</td>
194
- <td><u>1820</u></td>
195
- <td>58.3</td>
196
- <td>591</td>
197
- <td>50.3</td>
198
- <td>75.6</td>
199
- <td>53.2</td>
200
- <td>54.2</td>
201
- <td>2049.9</td>
202
- <td>77.8</td>
203
- <td>79.5</td>
204
- <td>50.4</td>
205
- <td>41.6</td>
206
- <td>76.7</td>
207
- <td>75.5</td>
208
- <td>-</td>
209
- <td>-</td>
210
- <td>-</td>
211
- </tr>
212
- <tr>
213
- <td nowrap="nowrap" align="left">GLM-4V-9B</td>
214
- <td>13B</td>
215
- <td>784</td>
216
- <td>59.1</td>
217
- <td>776</td>
218
- <td>51.1</td>
219
- <td>-</td>
220
- <td>58.0</td>
221
- <td>54.8</td>
222
- <td>2018.8</td>
223
- <td>67.9</td>
224
- <td>71.2</td>
225
- <td>46.9</td>
226
- <td>45.0</td>
227
- <td>-</td>
228
- <td>-</td>
229
- <td>-</td>
230
- <td>-</td>
231
- <td>-</td>
232
- </tr>
233
- <tr>
234
- <td nowrap="nowrap" align="left">Pixtral-12B</td>
235
- <td>12B</td>
236
- <td>256</td>
237
- <td>61.0</td>
238
- <td>685</td>
239
- <td>56.9</td>
240
- <td>81.8</td>
241
- <td>58.5</td>
242
- <td>54.5</td>
243
- <td>-</td>
244
- <td>72.7</td>
245
- <td>79.0</td>
246
- <td>51.1</td>
247
- <td>47.0</td>
248
- <td>75.7</td>
249
- <td>90.7</td>
250
- <td>-</td>
251
- <td>-</td>
252
- <td>-</td>
253
- </tr>
254
- <tr>
255
- <td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td>
256
- <td>27B</td>
257
- <td>672</td>
258
- <td>66.4</td>
259
- <td>809</td>
260
- <td>63.9</td>
261
- <td>86.0</td>
262
- <td>60.0</td>
263
- <td>61.9</td>
264
- <td>2253.0</td>
265
- <td>81.2</td>
266
- <td>83.8</td>
267
- <td>54.0</td>
268
- <td>45.3</td>
269
- <td><u>84.2</u></td>
270
- <td>93.3</td>
271
- <td>-</td>
272
- <td>-</td>
273
- <td>3.0</td>
274
- </tr>
275
- <tr>
276
- <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
277
- <td>8B</td>
278
- <td>784</td>
279
- <td>67.1</td>
280
- <td><u>866</u></td>
281
- <td>58.2</td>
282
- <td>83.0</td>
283
- <td>62.0</td>
284
- <td>60.7</td>
285
- <td>2326.0</td>
286
- <td>81.8</td>
287
- <td>83.0</td>
288
- <td>54.1</td>
289
- <td>50.6</td>
290
- <td><strong>84.3</strong></td>
291
- <td><u>94.5</u></td>
292
- <td>31.9</td>
293
- <td>16.3</td>
294
- <td>3.2</td>
295
- </tr>
296
- <tr>
297
- <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
298
- <td>72B</td>
299
- <td>182</td>
300
- <td>68.1</td>
301
- <td>741</td>
302
- <td>67.5</td>
303
- <td>83.7</td>
304
- <td>60.6</td>
305
- <td><strong>65.8</strong></td>
306
- <td>2261.0</td>
307
- <td><strong>85.0</strong></td>
308
- <td><u>85.6</u></td>
309
- <td>56.8</td>
310
- <td>49.0</td>
311
- <td>80.5</td>
312
- <td>91.3</td>
313
- <td>39.1</td>
314
- <td>-</td>
315
- <td>3.5</td>
316
- </tr>
317
- <tr>
318
- <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
319
- <td>8B</td>
320
- <td>706</td>
321
- <td>68.3</td>
322
- <td>822</td>
323
- <td><u>64.4</u></td>
324
- <td>84.8</td>
325
- <td>62.8</td>
326
- <td>62.8</td>
327
- <td>2344.0</td>
328
- <td><u>83.6</u></td>
329
- <td>84.5</td>
330
- <td>56.0</td>
331
- <td>50.1</td>
332
- <td>79.1</td>
333
- <td>93.0</td>
334
- <td>39.5</td>
335
- <td>19.7</td>
336
- <td>3.4</td>
337
- </tr>
338
- <tr>
339
- <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
340
- <td>8B</td>
341
- <td><strong>2822</strong></td>
342
- <td>65.2</td>
343
- <td>852*</td>
344
- <td>60.6</td>
345
- <td>79.4</td>
346
- <td>60.0</td>
347
- <td>57.5</td>
348
- <td><u>2348.4*</u></td>
349
- <td>78.0</td>
350
- <td>82.1</td>
351
- <td>49.8*</td>
352
- <td>48.1*</td>
353
- <td>80.1</td>
354
- <td>90.8</td>
355
- <td>25.7</td>
356
- <td>18.3</td>
357
- <td>3.6</td>
358
- </tr>
359
- <tr>
360
- <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
361
- <td>8B</td>
362
- <td><strong>2822</strong></td>
363
- <td><strong>70.2</strong></td>
364
- <td><strong>897*</strong></td>
365
- <td><strong>71.9*</strong></td>
366
- <td><u>86.9*</u></td>
367
- <td><u>67.5</u></td>
368
- <td><u>64.0</u></td>
369
- <td><strong>2372.0*</strong></td>
370
- <td>80.5</td>
371
- <td><strong>85.8</strong></td>
372
- <td>50.4*</td>
373
- <td><u>51.9</u></td>
374
- <td>82.0</td>
375
- <td>93.5</td>
376
- <td><u>41.4*</u></td>
377
- <td><u>23.1*</u></td>
378
- <td><strong>3.8</strong></td>
379
- </tr>
380
- </tbody>
381
- </table>
382
- </div>
383
- * We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
384
-
385
-
386
- <sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
387
-
388
- Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
389
-
390
-
391
- **Multi-image and Video Understanding**
392
-
393
- <div align="center">
394
-
395
- <table style="margin: 0px auto;">
396
- <thead>
397
- <tr>
398
- <th align="left">Model</th>
399
- <th>Size</th>
400
- <th>BLINK val</th>
401
- <th>Mantis Eval</th>
402
- <th>MIRB</th>
403
- <th>Video-MME (wo / w subs)</th>
404
- </tr>
405
- </thead>
406
- <tbody align="center">
407
- <tr>
408
- <td colspan="6" align="left"><strong>Proprietary</strong></td>
409
- </tr>
410
- <tr>
411
- <td nowrap="nowrap" align="left">GPT-4o-20240513</td>
412
- <td>-</td>
413
- <td><strong>68.0</strong></td>
414
- <td>-</td>
415
- <td>-</td>
416
- <td><strong>71.9/77.2<strong></td>
417
- </tr>
418
- <tr>
419
- <td nowrap="nowrap" align="left">GPT4V</td>
420
- <td>-</td>
421
- <td>54.6</td>
422
- <td>62.7</td>
423
- <td>53.1</td>
424
- <td>59.9/63.3</td>
425
- </tr>
426
- <tr>
427
- <td colspan="6" align="left"><strong>Open-source</strong></td>
428
- </tr>
429
- <tr>
430
- <td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td>
431
- <td>14B</td>
432
- <td>52.6</td>
433
- <td>66.4</td>
434
- <td>30.2</td>
435
- <td>-</td>
436
- </tr>
437
- <tr>
438
- <td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td>
439
- <td>72B</td>
440
- <td>55.4</td>
441
- <td><strong>77.6</strong></td>
442
- <td>-</td>
443
- <td><u>66.2/69.5</u></td>
444
- </tr>
445
- <tr>
446
- <td nowrap="nowrap" align="left">MANTIS 8B</td>
447
- <td>8B</td>
448
- <td>49.1</td>
449
- <td>59.5</td>
450
- <td>34.8</td>
451
- <td>-</td>
452
- </tr>
453
- <tr>
454
- <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
455
- <td>8B</td>
456
- <td>53.2</td>
457
- <td>69.6*</td>
458
- <td><strong>67.6*</strong></td>
459
- <td>63.3/69.0</td>
460
- </tr>
461
- <tr>
462
- <td nowrap="nowrap" align="left">InternVL2.5-8B</td>
463
- <td>8B</td>
464
- <td>54.8</td>
465
- <td>67.7</td>
466
- <td>52.5</td>
467
- <td>64.2/66.9</td>
468
- </tr>
469
- <tr>
470
- <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
471
- <td>8B</td>
472
- <td>53.0</td>
473
- <td>69.1</td>
474
- <td>53.8</td>
475
- <td>60.9/63.6</td>
476
- </tr>
477
- <tr>
478
- <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
479
- <td>8B</td>
480
- <td><u>56.7</u></td>
481
- <td><u>71.9</u></td>
482
- <td><u>58.6</u></td>
483
- <td>63.9/67.9</td>
484
- </tr>
485
- </tbody>
486
- </table>
487
-
488
- </div>
489
- * We evaluate officially released checkpoints by ourselves.
490
-
491
- </details>
492
-
493
-
494
- <details>
495
- <summary>Click to view audio understanding and speech conversation results.</summary>
496
-
497
- **Audio Understanding**
498
-
499
- <div align="center">
500
- <table style="margin: 0px auto;">
501
- <thead>
502
- <tr>
503
- <th align="left">Task</th>
504
- <th>Size</th>
505
- <th colspan="3">ASR (zh)</th>
506
- <th colspan="3">ASR (en)</th>
507
- <th colspan="2">AST</th>
508
- <th>Emotion</th>
509
- </tr>
510
- <tr>
511
- <th align="left">Metric</th>
512
- <td></td>
513
- <th colspan="3">CER↓</th>
514
- <th colspan="3">WER↓</th>
515
- <th colspan="2">BLEU↑</th>
516
- <th>ACC↑</th>
517
- </tr>
518
- <tr>
519
- <th align="left">Dataset</th>
520
- <td></td>
521
- <th>AISHELL-1</th>
522
- <th>Fleurs zh</th>
523
- <th>WenetSpeech test-net</th>
524
- <th>LibriSpeech test-clean</th>
525
- <th>GigaSpeech</th>
526
- <th>TED-LIUM</th>
527
- <th>CoVoST en2zh</th>
528
- <th>CoVoST zh2en</th>
529
- <th>MELD emotion</th>
530
- </tr>
531
- </thead>
532
- <tbody align="center">
533
- <tr>
534
- <td colspan="11" align="left"><strong>Proprietary</strong></td>
535
- </tr>
536
- <tr>
537
- <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
538
- <td>-</td>
539
- <td>7.3*</td>
540
- <td><u>5.4*</u></td>
541
- <td>28.9*</td>
542
- <td>2.6*</td>
543
- <td>12.9*</td>
544
- <td>4.8*</td>
545
- <td>37.1*</td>
546
- <td>15.7*</td>
547
- <td>33.2*</td>
548
- </tr>
549
- <tr>
550
- <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
551
- <td>-</td>
552
- <td>4.5*</td>
553
- <td>5.9*</td>
554
- <td>14.3*</td>
555
- <td>2.9*</td>
556
- <td>10.6*</td>
557
- <td><strong>3.0*</strong></td>
558
- <td><u>47.3*</u></td>
559
- <td>22.6*</td>
560
- <td>48.4*</td>
561
- </tr>
562
- <tr>
563
- <td colspan="11" align="left"><strong>Open-Source</strong></td>
564
- </tr>
565
- <tr>
566
- <td nowrap="nowrap" align="left">Qwen2-Audio-Base</td>
567
- <td>8B</td>
568
- <td>-</td>
569
- <td>7.5</td>
570
- <td>-</td>
571
- <td><strong>1.6</strong></td>
572
- <td>-</td>
573
- <td>-</td>
574
- <td>45.2</td>
575
- <td><u>24.4</u></td>
576
- <td><strong>55.3</strong></td>
577
- </tr>
578
- <tr>
579
- <td nowrap="nowrap" align="left">Qwen2-Audio-Instruction</td>
580
- <td>8B</td>
581
- <td>2.6*</td>
582
- <td>6.9*</td>
583
- <td><u>10.3*</u></td>
584
- <td>3.1*</td>
585
- <td><u>9.7</u>*</td>
586
- <td>5.9*</td>
587
- <td>39.5*</td>
588
- <td>22.9*</td>
589
- <td>17.4*</td>
590
- </tr>
591
- <tr>
592
- <td nowrap="nowrap" align="left">GLM-4-Voice-Base</td>
593
- <td>9B</td>
594
- <td><u>2.5</u></td>
595
- <td>-</td>
596
- <td>-</td>
597
- <td>2.8</td>
598
- <td>-</td>
599
- <td>-</td>
600
- <td>-</td>
601
- <td>-</td>
602
- </tr>
603
- <tr>
604
- <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
605
- <td>8B</td>
606
- <td><strong>1.6</strong></td>
607
- <td><strong>4.4</strong></td>
608
- <td><strong>6.9</strong></td>
609
- <td><u>1.7</u></td>
610
- <td><strong>8.7</strong></td>
611
- <td><strong>3.0</strong></td>
612
- <td><strong>48.2</strong></td>
613
- <td><strong>27.2</strong></td>
614
- <td><u>52.4</u></td>
615
- </tr>
616
- </tbody>
617
- </table>
618
- </div>
619
- * We evaluate officially released checkpoints by ourselves.<br><br>
620
-
621
- **Speech Generation**
622
-
623
- <div align="center">
624
- <table style="margin: 0px auto;">
625
- <thead>
626
- <tr>
627
- <th align="left">Task</th>
628
- <th>Size</th>
629
- <th colspan="9">SpeechQA</th>
630
- </tr>
631
- <tr>
632
- <th align="left">Metric</th>
633
- <th></th>
634
- <th colspan="3">ACC↑</th>
635
- <th>G-Eval (10 point)↑</th>
636
- <th>Semantic ELO score↑</th>
637
- <th>Acoustic ELO score↑</th>
638
- <th>Overall ELO score↑</th>
639
- <th>UTMOS↑</th>
640
- <th>ASR-WER↓</th>
641
- </tr>
642
- <tr>
643
- <th align="left">Dataset</th>
644
- <th></th>
645
- <th>Speech Llama Q.</th>
646
- <th>Speech Web Q.</th>
647
- <th>Speech Trivia QA</th>
648
- <th>Speech AlpacaEval</th>
649
- <th colspan="5">AudioArena</th>
650
- </tr>
651
- </thead>
652
- <tbody align="center">
653
- <tr>
654
- <td colspan="11" align="left"><strong>Proprietary</strong></td>
655
- </tr>
656
- <tr>
657
- <td nowrap="nowrap" align="left">GPT-4o-Realtime</td>
658
- <td></td>
659
- <td><strong>71.7</strong></td>
660
- <td><strong>51.6</strong></td>
661
- <td><strong>69.7</strong></td>
662
- <td><strong>7.4</strong></td>
663
- <td><strong>1157</strong></td>
664
- <td><strong>1203</strong></td>
665
- <td><strong>1200</strong></td>
666
- <td><strong>4.2</strong></td>
667
- <td><strong>2.3</strong></td>
668
- </tr>
669
- <tr>
670
- <td colspan="11" align="left"><strong>Open-Source</strong></td>
671
- </tr>
672
- <tr>
673
- <td nowrap="nowrap" align="left">GLM-4-Voice</td>
674
- <td>9B</td>
675
- <td>50.0</td>
676
- <td>32.0</td>
677
- <td>36.4</td>
678
- <td><u>5.1</u></td>
679
- <td>999</td>
680
- <td>1147</td>
681
- <td>1035</td>
682
- <td><u>4.1</u></td>
683
- <td><u>11.7</u></td>
684
- </tr>
685
- <tr>
686
- <td nowrap="nowrap" align="left">Llama-Omni</td>
687
- <td>8B</td>
688
- <td>45.3</td>
689
- <td>22.9</td>
690
- <td>10.7</td>
691
- <td>3.9</td>
692
- <td>960</td>
693
- <td>878</td>
694
- <td>897</td>
695
- <td>3.2</td>
696
- <td>24.3</td>
697
- </tr>
698
- <tr>
699
- <td nowrap="nowrap" align="left">Moshi</td>
700
- <td>7B</td>
701
- <td>43.7</td>
702
- <td>23.8</td>
703
- <td>16.7</td>
704
- <td>2.4</td>
705
- <td>871</td>
706
- <td>808</td>
707
- <td>875</td>
708
- <td>2.8</td>
709
- <td>8.2</td>
710
- </tr>
711
- <tr>
712
- <td nowrap="nowrap" align="left">Mini-Omni</td>
713
- <td>1B</td>
714
- <td>22.0</td>
715
- <td>12.8</td>
716
- <td>6.9</td>
717
- <td>2.5</td>
718
- <td>926</td>
719
- <td>803</td>
720
- <td>865</td>
721
- <td>3.4</td>
722
- <td>10.0</td>
723
- </tr>
724
- <tr>
725
- <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
726
- <td>8B</td>
727
- <td><u>61.0</u></td>
728
- <td><u>40.0</u></td>
729
- <td><u>40.2</u></td>
730
- <td><u>5.1</u></td>
731
- <td><u>1088</u></td>
732
- <td><u>1163</u></td>
733
- <td><u>1131</u></td>
734
- <td><strong>4.2</strong></td>
735
- <td>9.8</td>
736
- </tr>
737
- </tbody>
738
- </table>
739
- </div>
740
- All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br>
741
-
742
- **End-to-end Voice Cloning**
743
-
744
- <div align="center">
745
- <table style="margin: 0px auto;">
746
- <thead>
747
- <tr>
748
- <th align="left">Task</th>
749
- <th colspan="2">Voice cloning</th>
750
- </tr>
751
- <tr>
752
- <th align="left">Metric</th>
753
- <th>SIMO↑</th>
754
- <th>SIMO↑</th>
755
- </tr>
756
- <tr>
757
- <th align="left">Dataset</th>
758
- <th>Seed-TTS test-zh</th>
759
- <th>Seed-TTS test-en</th>
760
- </tr>
761
- </thead>
762
- <tbody align="center">
763
- <tr>
764
- <td nowrap="nowrap" align="left">F5-TTS</td>
765
- <td><strong>76</strong></td>
766
- <td><strong>67</strong></td>
767
- </tr>
768
- <tr>
769
- <td nowrap="nowrap" align="left">CosyVoice</td>
770
- <td><u>75</u></td>
771
- <td><u>64</u></td>
772
- </tr>
773
- <tr>
774
- <td nowrap="nowrap" align="left">FireRedTTS</td>
775
- <td>63</td>
776
- <td>46</td>
777
- </tr>
778
- <tr>
779
- <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
780
- <td>57</td>
781
- <td>47</td>
782
- </tr>
783
- </tbody>
784
- </table>
785
- </div>
786
-
787
- </details>
788
-
789
- <details>
790
- <summary>Click to view multimodal live streaming results.</summary>
791
-
792
- **Multimodal Live Streaming**: results on StreamingBench
793
-
794
- <table style="margin: 0px auto;">
795
- <thead>
796
- <tr>
797
- <th align="left">Model</th>
798
- <th>Size</th>
799
- <th>Real-Time Video Understanding</th>
800
- <th>Omni-Source Understanding</th>
801
- <th>Contextual Understanding</th>
802
- <th>Overall</th>
803
- </tr>
804
- </thead>
805
- <tbody align="center">
806
- <tr>
807
- <td colspan="7" align="left"><strong>Proprietary</strong></td>
808
- </tr>
809
- <tr>
810
- <td nowrap="nowrap" align="left">Gemini 1.5 Pro</td>
811
- <td>-</td>
812
- <td><u>77.4</u></td>
813
- <td><strong>67.8</strong></td>
814
- <td><strong>51.1</strong></td>
815
- <td><strong>70.3</strong></td>
816
- </tr>
817
- <tr>
818
- <td nowrap="nowrap" align="left">GPT-4o-202408</td>
819
- <td>-</td>
820
- <td>74.5</td>
821
- <td>51.0</td>
822
- <td><u>48.0</u></td>
823
- <td>64.1</td>
824
- </tr>
825
- <tr>
826
- <td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td>
827
- <td>-</td>
828
- <td>74.0</td>
829
- <td>41.4</td>
830
- <td>37.8</td>
831
- <td>59.7</td>
832
- </tr>
833
- <tr>
834
- <td colspan="9" align="left"><strong>Open-source</strong></td>
835
- </tr>
836
- <tr>
837
- <td nowrap="nowrap" align="left">VILA-1.5</td>
838
- <td>8B</td>
839
- <td>61.5</td>
840
- <td>37.5</td>
841
- <td>26.7</td>
842
- <td>49.5</td>
843
- </tr>
844
- <tr>
845
- <td nowrap="nowrap" align="left">LongVA</td>
846
- <td>7B</td>
847
- <td>63.1</td>
848
- <td>35.9</td>
849
- <td>30.2</td>
850
- <td>50.7</td>
851
- </tr>
852
- <tr>
853
- <td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td>
854
- <td>34B</td>
855
- <td>69.8</td>
856
- <td>41.7</td>
857
- <td>34.3</td>
858
- <td>56.7</td>
859
- </tr>
860
- <tr>
861
- <td nowrap="nowrap" align="left">Qwen2-VL-7B</td>
862
- <td>8B</td>
863
- <td>71.2</td>
864
- <td>40.7</td>
865
- <td>33.1</td>
866
- <td>57.0</td>
867
- </tr>
868
- <tr>
869
- <td nowrap="nowrap" align="left">InternVL2-8B</td>
870
- <td>8B</td>
871
- <td>70.1</td>
872
- <td>42.7</td>
873
- <td>34.1</td>
874
- <td>57.0</td>
875
- </tr>
876
- <tr>
877
- <td nowrap="nowrap" align="left">VITA-1.5</td>
878
- <td>8B</td>
879
- <td>70.9</td>
880
- <td>40.8</td>
881
- <td>35.8</td>
882
- <td>57.4</td>
883
- </tr>
884
- <tr>
885
- <td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td>
886
- <td>8B</td>
887
- <td>74.3</td>
888
- <td>40.8</td>
889
- <td>31.0</td>
890
- <td>58.4</td>
891
- </tr>
892
- <tr>
893
- <td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td>
894
- <td>8B</td>
895
- <td>75.4</td>
896
- <td>46.2</td>
897
- <td>33.6</td>
898
- <td>60.8</td>
899
- </tr>
900
- <tr>
901
- <td nowrap="nowrap" align="left">MiniCPM-V 2.6</td>
902
- <td>8B</td>
903
- <td>72.4</td>
904
- <td>40.2</td>
905
- <td>33.4</td>
906
- <td>57.7</td>
907
- </tr>
908
- <tr>
909
- <td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
910
- <td>8B</td>
911
- <td><strong>79.9</strong></td>
912
- <td><u>53.4</u></td>
913
- <td>38.5</td>
914
- <td><u>66.0</u></td>
915
- </tr>
916
- </tbody>
917
- </table>
918
-
919
- </details>
920
-
921
-
922
- ### Examples <!-- omit in toc -->
923
-
924
- We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
925
-
926
- <div align="center">
927
- <a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
928
- </div>
929
-
930
- <br>
931
-
932
-
933
- <div style="display: flex; flex-direction: column; align-items: center;">
934
- <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
935
- <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
936
- <img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
937
- </div>
938
-
939
-
940
-
941
-
942
- ## Online Demo
943
- Click here to try the online demo of [MiniCPM-o 2.6](https://minicpm-omni-webdemo-us.modelbest.cn).
944
-
945
-
946
- ## Usage
947
- Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
948
- ```
949
- Pillow==10.1.0
950
- torch==2.2.0
951
- torchaudio==2.2.0
952
- torchvision==0.17.0
953
- transformers==4.44.2
954
- librosa==0.9.0
955
- soundfile==0.12.1
956
- vector-quantize-pytorch==1.18.5
957
- vocos==0.1.0
958
- decord
959
- moviepy
960
- ```
961
-
962
-
963
- ### Model initialization
964
- ```python
965
-
966
- import torch
967
- from PIL import Image
968
- from transformers import AutoModel, AutoTokenizer
969
-
970
- # load omni model default, the default init_vision/init_audio/init_tts is True
971
- # if load vision-only model, please set init_audio=False and init_tts=False
972
- # if load audio-only model, please set init_vision=False
973
- model = AutoModel.from_pretrained(
974
- 'openbmb/MiniCPM-o-2_6',
975
- trust_remote_code=True,
976
- attn_implementation='sdpa', # sdpa or flash_attention_2
977
- torch_dtype=torch.bfloat16,
978
- init_vision=True,
979
- init_audio=True,
980
- init_tts=True
981
- )
982
-
983
-
984
- model = model.eval().cuda()
985
- tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)
986
-
987
- # In addition to vision-only mode, tts processor and vocos also needs to be initialized
988
- model.init_tts()
989
- model.tts.float()
990
- ```
991
- ### Omni mode
992
- we provide two inference modes: chat and streaming
993
-
994
- #### Chat inference
995
- ```python
996
- import math
997
- import numpy as np
998
- from PIL import Image
999
- from moviepy.editor import VideoFileClip
1000
- import tempfile
1001
- import librosa
1002
- import soundfile as sf
1003
-
1004
- def get_video_chunk_content(video_path, flatten=True):
1005
- video = VideoFileClip(video_path)
1006
- print('video_duration:', video.duration)
1007
-
1008
- with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
1009
- temp_audio_file_path = temp_audio_file.name
1010
- video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
1011
- audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
1012
- num_units = math.ceil(video.duration)
1013
-
1014
- # 1 frame + 1s audio chunk
1015
- contents= []
1016
- for i in range(num_units):
1017
- frame = video.get_frame(i+1)
1018
- image = Image.fromarray((frame).astype(np.uint8))
1019
- audio = audio_np[sr*i:sr*(i+1)]
1020
- if flatten:
1021
- contents.extend(["<unit>", image, audio])
1022
- else:
1023
- contents.append(["<unit>", image, audio])
1024
-
1025
- return contents
1026
-
1027
- video_path="/path/to/video"
1028
- # if use voice clone prompt, please set ref_audio
1029
- ref_audio_path = 'assets/demo.wav'
1030
- ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
1031
- sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
1032
- # or use default prompt
1033
- # sys_msg = model.get_sys_prompt(mode='omni', language='en')
1034
-
1035
- contents = get_video_chunk_content(video_path)
1036
- msg = {"role":"user", "content": contents}
1037
- msgs = [sys_msg, msg]
1038
-
1039
- # please set generate_audio=True and output_audio_path to save the tts result
1040
- generate_audio = True
1041
- output_audio_path = 'output.wav'
1042
-
1043
- res = model.chat(
1044
- msgs=msgs,
1045
- tokenizer=tokenizer,
1046
- sampling=True,
1047
- temperature=0.5,
1048
- max_new_tokens=4096,
1049
- omni_input=True, # please set omni_input=True when omni inference
1050
- use_tts_template=True,
1051
- generate_audio=generate_audio,
1052
- output_audio_path=output_audio_path,
1053
- max_slice_nums=1,
1054
- use_image_id=False,
1055
- return_dict=True
1056
- )
1057
- print(res)
1058
- ```
1059
- #### Streaming inference
1060
- ```python
1061
- # a new conversation need reset session first, it will reset the kv-cache
1062
- model.reset_session()
1063
-
1064
- contents = get_video_chunk_content(video_path, flatten=False)
1065
- session_id = '123'
1066
- generate_audio = True
1067
-
1068
- # 1. prefill system prompt
1069
- res = model.streaming_prefill(
1070
- session_id=session_id,
1071
- msgs=[sys_msg],
1072
- tokenizer=tokenizer
1073
- )
1074
-
1075
- # 2. prefill video/audio chunks
1076
- for content in contents:
1077
- msgs = [{"role":"user", "content": content}]
1078
- res = model.streaming_prefill(
1079
- session_id=session_id,
1080
- msgs=msgs,
1081
- tokenizer=tokenizer
1082
- )
1083
-
1084
- # 3. generate
1085
- res = model.streaming_generate(
1086
- session_id=session_id,
1087
- tokenizer=tokenizer,
1088
- temperature=0.5,
1089
- generate_audio=generate_audio
1090
- )
1091
-
1092
- audios = []
1093
- text = ""
1094
-
1095
- if generate_audio:
1096
- for r in res:
1097
- audio_wav = r.audio_wav
1098
- sampling_rate = r.sampling_rate
1099
- txt = r.text
1100
-
1101
- audios.append(audio_wav)
1102
- text += txt
1103
-
1104
- res = np.concatenate(audios)
1105
- sf.write("output.wav", res, samplerate=sampling_rate)
1106
- print("text:", text)
1107
- print("audio saved to output.wav")
1108
- else:
1109
- for r in res:
1110
- text += r['text']
1111
- print("text:", text)
1112
-
1113
- ```
1114
-
1115
- ### Audio-Only mode
1116
- #### Mimick
1117
- `Mimick` task reflects a model's end-to-end speech modeling capability. The model takes audio input, and outputs an ASR transcription and subsequently reconstructs the original audio with high similarity. The higher the similarity between the reconstructed audio and the original audio, the stronger the model's foundational capability in end-to-end speech modeling.
1118
- ```python
1119
- mimick_prompt = "Please repeat each user's speech, including voice style and speech content."
1120
- audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1121
- msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}]
1122
-
1123
- res = model.chat(
1124
- msgs=msgs,
1125
- tokenizer=tokenizer,
1126
- sampling=True,
1127
- max_new_tokens=128,
1128
- use_tts_template=True,
1129
- temperature=0.3,
1130
- generate_audio=True,
1131
- output_audio_path='output.wav', # save the tts result to output_audio_path
1132
- )
1133
- ```
1134
-
1135
- #### General Speech Conversation with Configurable Voices
1136
- <details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary>
1137
-
1138
- ```python
1139
- ref_audio, _ = librosa.load('assets/demo.wav', sr=16000, mono=True) # load the reference audio
1140
-
1141
- # Choose the mode you want to use
1142
- # Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt. (More human-like conversation but unstable)
1143
- # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')
1144
- # user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1145
-
1146
- Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. (Stable and more suitable for general conversation)
1147
- sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en')
1148
- user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something by recording it in 'xxx.wav'!!!
1149
- ```
1150
- ```python
1151
- msgs = [sys_prompt, user_question]
1152
- # round one
1153
- res = model.chat(
1154
- msgs=msgs,
1155
- tokenizer=tokenizer,
1156
- sampling=True,
1157
- max_new_tokens=128,
1158
- use_tts_template=True,
1159
- generate_audio=True,
1160
- temperature=0.3,
1161
- output_audio_path='result.wav',
1162
- )
1163
-
1164
- # round two
1165
- history = msgs.append({'role': 'assistant', 'content': res})
1166
- user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
1167
- msgs = history.append(user_question)
1168
- res = model.chat(
1169
- msgs=msgs,
1170
- tokenizer=tokenizer,
1171
- sampling=True,
1172
- max_new_tokens=128,
1173
- use_tts_template=True,
1174
- generate_audio=True,
1175
- temperature=0.3,
1176
- output_audio_path='result_round_2.wav',
1177
- )
1178
- print(res)
1179
- ```
1180
-
1181
- </details>
1182
-
1183
- #### Addressing various audio tasks
1184
- <details>
1185
- <summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary>
1186
-
1187
- ```python
1188
- '''
1189
- Audio Understanding Task Prompt:
1190
- Speech:
1191
- ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。
1192
- ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content.
1193
- Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status.
1194
- General Audio:
1195
- Audio Caption: Summarize the main content of the audio.
1196
- Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene.
1197
- '''
1198
- task_prompt = "" # Choose the task prompt above
1199
- audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True)
1200
-
1201
- msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}]
1202
-
1203
- res = model.chat(
1204
- msgs=msgs,
1205
- tokenizer=tokenizer,
1206
- sampling=True,
1207
- max_new_tokens=128,
1208
- use_tts_template=True,
1209
- generate_audio=True,
1210
- temperature=0.3,
1211
- output_audio_path='result.wav',
1212
- )
1213
- print(res)
1214
- ```
1215
- ```python
1216
- '''
1217
- Speech Generation Task Prompt:
1218
- Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/
1219
- Example:
1220
- # 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。
1221
- # Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context.
1222
-
1223
- Voice Cloning or Voice Conversion: With this mode, model will act like a TTS model.
1224
- '''
1225
- # Human Instruction-to-Speech:
1226
- task_prompt = '' #Try to make some Human Instruction-to-Speech prompt (Voice Creation)
1227
- msgs = [{'role': 'user', 'content': [task_prompt]}] # you can also try to ask the same audio question
1228
-
1229
- # Voice Cloning mode:
1230
- # sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
1231
- # text_prompt = f"Please read the text below."
1232
- # user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning)
1233
- # user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Conversion)
1234
- # msgs = [sys_prompt, user_question]
1235
-
1236
- res = model.chat(
1237
- msgs=msgs,
1238
- tokenizer=tokenizer,
1239
- sampling=True,
1240
- max_new_tokens=128,
1241
- use_tts_template=True,
1242
- generate_audio=True,
1243
- temperature=0.3,
1244
- output_audio_path='result.wav',
1245
- )
1246
-
1247
-
1248
- ```
1249
-
1250
- </details>
1251
-
1252
- ### Vision-Only mode
1253
-
1254
- `MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
1255
-
1256
- #### Chat with single image
1257
- ```python
1258
- # test.py
1259
- image = Image.open('xx.jpg').convert('RGB')
1260
- question = 'What is in the image?'
1261
- msgs = [{'role': 'user', 'content': [image, question]}]
1262
- res = model.chat(
1263
- image=None,
1264
- msgs=msgs,
1265
- tokenizer=tokenizer
1266
- )
1267
- print(res)
1268
-
1269
- ## if you want to use streaming, please make sure sampling=True and stream=True
1270
- ## the model.chat will return a generator
1271
- res = model.chat(
1272
- msgs=msgs,
1273
- tokenizer=tokenizer,
1274
- sampling=True,
1275
- stream=True
1276
- )
1277
- generated_text = ""
1278
- for new_text in res:
1279
- generated_text += new_text
1280
- print(new_text, flush=True, end='')
1281
- ```
1282
-
1283
- #### Chat with multiple images
1284
- <details>
1285
- <summary> Click to show Python code running MiniCPM-o 2.6 with multiple images input. </summary>
1286
-
1287
- ```python
1288
- image1 = Image.open('image1.jpg').convert('RGB')
1289
- image2 = Image.open('image2.jpg').convert('RGB')
1290
- question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
1291
- msgs = [{'role': 'user', 'content': [image1, image2, question]}]
1292
- answer = model.chat(
1293
- msgs=msgs,
1294
- tokenizer=tokenizer
1295
- )
1296
- print(answer)
1297
- ```
1298
- </details>
1299
-
1300
- #### In-context few-shot learning
1301
- <details>
1302
- <summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary>
1303
-
1304
- ```python
1305
- question = "production date"
1306
- image1 = Image.open('example1.jpg').convert('RGB')
1307
- answer1 = "2023.08.04"
1308
- image2 = Image.open('example2.jpg').convert('RGB')
1309
- answer2 = "2007.04.24"
1310
- image_test = Image.open('test.jpg').convert('RGB')
1311
- msgs = [
1312
- {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
1313
- {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
1314
- {'role': 'user', 'content': [image_test, question]}
1315
- ]
1316
- answer = model.chat(
1317
- msgs=msgs,
1318
- tokenizer=tokenizer
1319
- )
1320
- print(answer)
1321
- ```
1322
- </details>
1323
-
1324
- #### Chat with video
1325
- <details>
1326
- <summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary>
1327
-
1328
- ```python
1329
- MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
1330
- def encode_video(video_path):
1331
- def uniform_sample(l, n):
1332
- gap = len(l) / n
1333
- idxs = [int(i * gap + gap / 2) for i in range(n)]
1334
- return [l[i] for i in idxs]
1335
- vr = VideoReader(video_path, ctx=cpu(0))
1336
- sample_fps = round(vr.get_avg_fps() / 1) # FPS
1337
- frame_idx = [i for i in range(0, len(vr), sample_fps)]
1338
- if len(frame_idx) > MAX_NUM_FRAMES:
1339
- frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
1340
- frames = vr.get_batch(frame_idx).asnumpy()
1341
- frames = [Image.fromarray(v.astype('uint8')) for v in frames]
1342
- print('num frames:', len(frames))
1343
- return frames
1344
- video_path ="video_test.mp4"
1345
- frames = encode_video(video_path)
1346
- question = "Describe the video"
1347
- msgs = [
1348
- {'role': 'user', 'content': frames + [question]},
1349
- ]
1350
- # Set decode params for video
1351
- params={}
1352
- params["use_image_id"] = False
1353
- params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
1354
- answer = model.chat(
1355
- msgs=msgs,
1356
- tokenizer=tokenizer,
1357
- **params
1358
- )
1359
- print(answer)
1360
- ```
1361
- </details>
1362
-
1363
- Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-o) for more detail about usage.
1364
-
1365
-
1366
- ## Inference with llama.cpp<a id="llamacpp"></a>
1367
- MiniCPM-o 2.6 (vision-only mode) can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-omni) and [readme](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) for more detail.
1368
-
1369
-
1370
- ## Int4 quantized version
1371
- Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4).
1372
-
1373
-
1374
- ## License
1375
- #### Model License
1376
- * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
1377
- * The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
1378
- * The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use.
1379
-
1380
-
1381
- #### Statement
1382
- * As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers
1383
- * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
1384
-
1385
- ## Key Techniques and Other Multimodal Projects
1386
-
1387
- 👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team:
1388
-
1389
- [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)
1390
-
1391
- ## Citation
1392
-
1393
- If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️!
1394
-
1395
- ```bib
1396
- @article{yao2024minicpm,
1397
- title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
1398
- author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
1399
- journal={arXiv preprint arXiv:2408.01800},
1400
- year={2024}
1401
- }
1402
- ```
 
24
 
25
  <h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
26
 
27
+ ## MiniCPM-o 2.6 int4
28
+ This is the int4 quantized version of [**MiniCPM-o 2.6**](https://modelscope.cn/models/OpenBMB/MiniCPM-o-2_6).
29
+ Running with int4 version would use lower GPU memory (about 9GB).