|
--- |
|
pipeline_tag: image-text-to-text |
|
datasets: |
|
- openbmb/RLAIF-V-Dataset |
|
library_name: transformers |
|
language: |
|
- multilingual |
|
tags: |
|
- minicpm-o |
|
- omni |
|
- vision |
|
- ocr |
|
- multi-image |
|
- video |
|
- custom_code |
|
--- |
|
|
|
<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1> |
|
|
|
[GitHub](https://github.com/OpenBMB/MiniCPM-V) | [Online Demo](https://minicpm-omni-webdemo-us.modelbest.cn)</a> |
|
|
|
|
|
## MiniCPM-o 2.6 |
|
|
|
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for realtime speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include: |
|
|
|
- 🔥 **Leading Visual Capability.** |
|
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability. |
|
|
|
- 🎙 **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual realtime speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, voice cloning, role play, etc. |
|
|
|
- 🎬 **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support realtime speech interaction**. It **outperforms GPT-4o-realtime and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding , and multimodal contextual understanding. |
|
|
|
- 💪 **Strong OCR Capability and Others.** |
|
Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**. |
|
Based on the the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) and [VisCPM](https://github.com/OpenBMB/VisCPM) techniques, it features **trustworthy behaviors**, outperforming GPT-4o and Claude 3.5 Sonnet on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages. |
|
|
|
|
|
- 🚀 **Superior Efficiency.** |
|
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad. |
|
|
|
- 💫 **Easy Usage.** |
|
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](XXX) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/ |
|
) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server. |
|
|
|
|
|
**Model Architecture.** |
|
|
|
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge. |
|
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for streaminig inputs/outputs. (2) We devise a time-division multiplexing (TDM) mechanism for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices. |
|
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation. |
|
|
|
<div align="center"> |
|
<img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpm-o-26-framework.png" , width=80%> |
|
</div> |
|
|
|
### Evaluation <!-- omit in toc --> |
|
|
|
<div align="center"> |
|
<img src="https://github.com/OpenBMB/MiniCPM-V/raw/main/assets/radar.png" width=66% /> |
|
</div> |
|
|
|
<details> |
|
<summary>Click to view visual understanding results.</summary> |
|
|
|
**Image Understanding** |
|
|
|
<div align="center"> |
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th align="left">Model</th> |
|
<th>Size</th> |
|
<th>Token Density<sup>+</sup></th> |
|
<th>OpenCompass</th> |
|
<th>OCRBench</th> |
|
<th>MathVista mini</th> |
|
<th>ChartQA</th> |
|
<th>MMVet</th> |
|
<th>MMStar</th> |
|
<th>MME</th> |
|
<th>MMB1.1 test</th> |
|
<th>AI2D</th> |
|
<th>MMMU val</th> |
|
<th>HallusionBench</th> |
|
<th>TextVQA val</th> |
|
<th>DocVQA test</th> |
|
<th>MathVerse mini</th> |
|
<th>MathVision</th> |
|
<th>MMHal Score</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="19" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4o-20240513</td> |
|
<td>-</td> |
|
<td>1088</td> |
|
<td><u>69.9</u></td> |
|
<td>736</td> |
|
<td>61.3</td> |
|
<td>85.7</td> |
|
<td><strong>69.1</strong></td> |
|
<td>63.9</td> |
|
<td>2328.7</td> |
|
<td>82.2</td> |
|
<td>84.6</td> |
|
<td><strong>69.2</strong></td> |
|
<td><strong>55.0</strong></td> |
|
<td>-</td> |
|
<td>92.8</td> |
|
<td><strong>50.2</strong></td> |
|
<td><strong>30.4</strong></td> |
|
<td><u>3.6</u></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Claude3.5-Sonnet</td> |
|
<td>-</td> |
|
<td>750</td> |
|
<td>67.9</td> |
|
<td>788</td> |
|
<td>61.6</td> |
|
<td><strong>90.8</strong></td> |
|
<td>66.0</td> |
|
<td>62.2</td> |
|
<td>1920.0</td> |
|
<td>78.5</td> |
|
<td>80.2</td> |
|
<td><u>65.9</u></td> |
|
<td>49.9</td> |
|
<td>-</td> |
|
<td><strong>95.2</strong></td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>3.4</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>64.4</td> |
|
<td>754</td> |
|
<td>57.7</td> |
|
<td>81.3</td> |
|
<td>64.0</td> |
|
<td>59.1</td> |
|
<td>2110.6</td> |
|
<td>73.9</td> |
|
<td>79.1</td> |
|
<td>60.6</td> |
|
<td>45.6</td> |
|
<td>73.5</td> |
|
<td>86.5</td> |
|
<td>-</td> |
|
<td>19.2</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4o-mini-20240718</td> |
|
<td>-</td> |
|
<td>1088</td> |
|
<td>64.1</td> |
|
<td>785</td> |
|
<td>52.4</td> |
|
<td>-</td> |
|
<td>66.9</td> |
|
<td>54.8</td> |
|
<td>2003.4</td> |
|
<td>76.0</td> |
|
<td>77.8</td> |
|
<td>60.0</td> |
|
<td>46.1</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>3.3</td> |
|
</tr> |
|
<tr> |
|
<td colspan="19" align="left"><strong>Open Source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Cambrian-34B</td> |
|
<td>34B</td> |
|
<td><u>1820</u></td> |
|
<td>58.3</td> |
|
<td>591</td> |
|
<td>50.3</td> |
|
<td>75.6</td> |
|
<td>53.2</td> |
|
<td>54.2</td> |
|
<td>2049.9</td> |
|
<td>77.8</td> |
|
<td>79.5</td> |
|
<td>50.4</td> |
|
<td>41.6</td> |
|
<td>76.7</td> |
|
<td>75.5</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GLM-4V-9B</td> |
|
<td>13B</td> |
|
<td>784</td> |
|
<td>59.1</td> |
|
<td>776</td> |
|
<td>51.1</td> |
|
<td>-</td> |
|
<td>58.0</td> |
|
<td>54.8</td> |
|
<td>2018.8</td> |
|
<td>67.9</td> |
|
<td>71.2</td> |
|
<td>46.9</td> |
|
<td>45.0</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Pixtral-12B</td> |
|
<td>12B</td> |
|
<td>256</td> |
|
<td>61.0</td> |
|
<td>685</td> |
|
<td>56.9</td> |
|
<td>81.8</td> |
|
<td>58.5</td> |
|
<td>54.5</td> |
|
<td>-</td> |
|
<td>72.7</td> |
|
<td>79.0</td> |
|
<td>51.1</td> |
|
<td>47.0</td> |
|
<td>75.7</td> |
|
<td>90.7</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">DeepSeek-VL2-27B (4B)</td> |
|
<td>27B</td> |
|
<td>672</td> |
|
<td>66.4</td> |
|
<td>809</td> |
|
<td>63.9</td> |
|
<td>86.0</td> |
|
<td>60.0</td> |
|
<td>61.9</td> |
|
<td>2253.0</td> |
|
<td>81.2</td> |
|
<td>83.8</td> |
|
<td>54.0</td> |
|
<td>45.3</td> |
|
<td><u>84.2</u></td> |
|
<td>93.3</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>3.0</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td> |
|
<td>8B</td> |
|
<td>784</td> |
|
<td>67.1</td> |
|
<td><u>866</u></td> |
|
<td>58.2</td> |
|
<td>83.0</td> |
|
<td>62.0</td> |
|
<td>60.7</td> |
|
<td>2326.0</td> |
|
<td>81.8</td> |
|
<td>83.0</td> |
|
<td>54.1</td> |
|
<td>50.6</td> |
|
<td><strong>84.3</strong></td> |
|
<td><u>94.5</u></td> |
|
<td>31.9</td> |
|
<td>16.3</td> |
|
<td>3.2</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">LLaVA-OneVision-72B</td> |
|
<td>72B</td> |
|
<td>182</td> |
|
<td>68.1</td> |
|
<td>741</td> |
|
<td>67.5</td> |
|
<td>83.7</td> |
|
<td>60.6</td> |
|
<td><strong>65.8</strong></td> |
|
<td>2261.0</td> |
|
<td><strong>85.0</strong></td> |
|
<td><u>85.6</u></td> |
|
<td>56.8</td> |
|
<td>49.0</td> |
|
<td>80.5</td> |
|
<td>91.3</td> |
|
<td>39.1</td> |
|
<td>-</td> |
|
<td>3.5</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL-2.5-8B</td> |
|
<td>8B</td> |
|
<td>706</td> |
|
<td>68.3</td> |
|
<td>822</td> |
|
<td><u>64.4</u></td> |
|
<td>84.8</td> |
|
<td>62.8</td> |
|
<td>62.8</td> |
|
<td>2344.0</td> |
|
<td><u>83.6</u></td> |
|
<td>84.5</td> |
|
<td>56.0</td> |
|
<td>50.1</td> |
|
<td>79.1</td> |
|
<td>93.0</td> |
|
<td>39.5</td> |
|
<td>19.7</td> |
|
<td>3.4</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td> |
|
<td>8B</td> |
|
<td><strong>2822</strong></td> |
|
<td>65.2</td> |
|
<td>852*</td> |
|
<td>60.6</td> |
|
<td>79.4</td> |
|
<td>60.0</td> |
|
<td>57.5</td> |
|
<td><u>2348.4*</u></td> |
|
<td>78.0</td> |
|
<td>82.1</td> |
|
<td>49.8*</td> |
|
<td>48.1*</td> |
|
<td>80.1</td> |
|
<td>90.8</td> |
|
<td>25.7</td> |
|
<td>18.3</td> |
|
<td>3.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> |
|
<td>8B</td> |
|
<td><strong>2822</strong></td> |
|
<td><strong>70.2</strong></td> |
|
<td><strong>897*</strong></td> |
|
<td><strong>71.9*</strong></td> |
|
<td><u>86.9*</u></td> |
|
<td><u>67.5</u></td> |
|
<td><u>64.0</u></td> |
|
<td><strong>2372.0*</strong></td> |
|
<td>80.5</td> |
|
<td><strong>85.8</strong></td> |
|
<td>50.4*</td> |
|
<td><u>51.9</u></td> |
|
<td>82.0</td> |
|
<td>93.5</td> |
|
<td><u>41.4*</u></td> |
|
<td><u>23.1*</u></td> |
|
<td><strong>3.8</strong></td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set. |
|
|
|
|
|
<sup>+</sup> Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens. |
|
|
|
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation. |
|
|
|
|
|
**Multi-image and Video Understanding** |
|
|
|
<div align="center"> |
|
|
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th align="left">Model</th> |
|
<th>Size</th> |
|
<th>BLINK-val</th> |
|
<th>Mantis-Eval</th> |
|
<th>MIRB</th> |
|
<th>Video-MME (wo / w subs)</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="6" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4o-20240513</td> |
|
<td>-</td> |
|
<td><strong>68</strong></td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td><strong>71.9/77.2<strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT4V</td> |
|
<td>-</td> |
|
<td>54.6</td> |
|
<td>62.7</td> |
|
<td>53.1</td> |
|
<td>59.9/63.3</td> |
|
</tr> |
|
<tr> |
|
<td colspan="6" align="left"><strong>Open-source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">LLaVA-NeXT-Interleave 14B</td> |
|
<td>14B</td> |
|
<td>52.6</td> |
|
<td>66.4</td> |
|
<td>30.2</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">LLaVA-One-Vision-72B</td> |
|
<td>72B</td> |
|
<td>55.4</td> |
|
<td><strong>77.6</strong></td> |
|
<td>-</td> |
|
<td><u>66.2/69.5</u></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MANTIS 8B</td> |
|
<td>8B</td> |
|
<td>49.1</td> |
|
<td>59.5</td> |
|
<td>34.8</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td> |
|
<td>8B</td> |
|
<td>53.2</td> |
|
<td>69.6*</td> |
|
<td><strong>67.6*</strong></td> |
|
<td>63.3/69.0</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL-2.5-8B</td> |
|
<td>8B</td> |
|
<td>54.8</td> |
|
<td>67.7</td> |
|
<td>52.5</td> |
|
<td>64.2/66.9</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td> |
|
<td>8B</td> |
|
<td>53</td> |
|
<td>69.1</td> |
|
<td>53.8</td> |
|
<td>60.9/63.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> |
|
<td>8B</td> |
|
<td><u>56.7</u></td> |
|
<td><u>71.9</u></td> |
|
<td><u>58.6</u></td> |
|
<td>63.9/67.9</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
</div> |
|
* We evaluate officially released checkpoints by ourselves. |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
<summary>Click to view audio understanding and speech conversation results.</summary> |
|
|
|
**Audio Understanding** |
|
|
|
<div align="center"> |
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th align="left">Task</th> |
|
<th>Size</th> |
|
<th colspan="3">ASR (zh)</th> |
|
<th colspan="3">ASR (en)</th> |
|
<th colspan="2">ASR</th> |
|
<th>Emotion</th> |
|
</tr> |
|
<tr> |
|
<th align="left">Metric</th> |
|
<td></td> |
|
<th colspan="3">CER↓</th> |
|
<th colspan="3">WER↓</th> |
|
<th colspan="2">BLEU↑</th> |
|
<th>ACC↑</th> |
|
</tr> |
|
<tr> |
|
<th align="left">Dataset</th> |
|
<td></td> |
|
<th>AISHELL-1</th> |
|
<th>Fleurs zh</th> |
|
<th>WenetSpeech test-net</th> |
|
<th>LibriSpeech test-clean</th> |
|
<th>GigaSpeech</th> |
|
<th>TED-LIUM</th> |
|
<th>CoVoST en2zh</th> |
|
<th>CoVoST zh2en</th> |
|
<th>MELD emotion</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="11" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td> |
|
<td>-</td> |
|
<td>7.3*</td> |
|
<td><u>5.4*</u></td> |
|
<td>28.9*</td> |
|
<td>2.6*</td> |
|
<td>12.9*</td> |
|
<td>4.8*</td> |
|
<td>37.1*</td> |
|
<td>15.7*</td> |
|
<td>33.2*</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td> |
|
<td>-</td> |
|
<td>4.5*</td> |
|
<td>5.9*</td> |
|
<td>14.3*</td> |
|
<td>2.9*</td> |
|
<td>10.6*</td> |
|
<td><strong>3.0*</strong></td> |
|
<td><u>47.3*</u></td> |
|
<td>22.6*</td> |
|
<td>48.4*</td> |
|
</tr> |
|
<tr> |
|
<td colspan="11" align="left"><strong>Open-Source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2-Audio</td> |
|
<td>8B</td> |
|
<td>-</td> |
|
<td>7.5</td> |
|
<td>-</td> |
|
<td><strong>1.6</strong></td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>45.2</td> |
|
<td><u>24.4</u></td> |
|
<td><strong>55.3</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2-Audio-Instruction</td> |
|
<td>8B</td> |
|
<td>2.6*</td> |
|
<td>6.9*</td> |
|
<td><u>10.3*</u></td> |
|
<td>3.1*</td> |
|
<td><u>9.7</u>*</td> |
|
<td>5.9*</td> |
|
<td>39.5*</td> |
|
<td>22.9*</td> |
|
<td>17.4*</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GLM-4-Voice-Base</td> |
|
<td>9B</td> |
|
<td><u>2.5</u></td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>2.8</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr style="background-color: #e6f2ff;"> |
|
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> |
|
<td>8B</td> |
|
<td><strong>1.6</strong></td> |
|
<td><strong>4.4</strong></td> |
|
<td><strong>6.9</strong></td> |
|
<td><u>1.7</u></td> |
|
<td><strong>8.7</strong></td> |
|
<td><strong>3.0</strong></td> |
|
<td><strong>48.2</strong></td> |
|
<td><strong>27.2</strong></td> |
|
<td><u>52.4</u></td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
* We evaluate officially released checkpoints by ourselves.<br><br> |
|
|
|
**Speech Generation** |
|
|
|
<div align="center"> |
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th align="left">Task</th> |
|
<th>Size</th> |
|
<th colspan="9">SpeechQA</th> |
|
</tr> |
|
<tr> |
|
<th align="left">Metric</th> |
|
<th></th> |
|
<th colspan="3">ACC↑</th> |
|
<th>G-Eval (10 point)↑</th> |
|
<th>Semantic ELO score↑</th> |
|
<th>Acoustic ELO score↑</th> |
|
<th>Overall ELO score↑</th> |
|
<th>UTMOS↑</th> |
|
<th>ASR-WER↓</th> |
|
</tr> |
|
<tr> |
|
<th align="left">Dataset</th> |
|
<th></th> |
|
<th>Speech Llama Q.</th> |
|
<th>Speech Web Q.</th> |
|
<th>Speech Trivia QA</th> |
|
<th>Speech AlpacaEval</th> |
|
<th colspan="5">AudioArena</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="11" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4o-Realtime</td> |
|
<td></td> |
|
<td><strong>71.7</strong></td> |
|
<td><strong>51.6</strong></td> |
|
<td><strong>69.7</strong></td> |
|
<td><strong>7.4</strong></td> |
|
<td><strong>1157</strong></td> |
|
<td><strong>1203</strong></td> |
|
<td><strong>1200</strong></td> |
|
<td><strong>4.2</strong></td> |
|
<td><strong>2.3</strong></td> |
|
</tr> |
|
<tr> |
|
<td colspan="11" align="left"><strong>Open-Source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GLM-4-Voice</td> |
|
<td>9B</td> |
|
<td>50.0</td> |
|
<td>32.0</td> |
|
<td>36.4</td> |
|
<td><u>5.1</u></td> |
|
<td>999</td> |
|
<td>1147</td> |
|
<td>1035</td> |
|
<td><u>4.1</u></td> |
|
<td><u>11.7</u></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Llama-Omni</td> |
|
<td>8B</td> |
|
<td>45.3</td> |
|
<td>22.9</td> |
|
<td>10.7</td> |
|
<td>3.9</td> |
|
<td>960</td> |
|
<td>878</td> |
|
<td>897</td> |
|
<td>3.2</td> |
|
<td>24.3</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Moshi</td> |
|
<td>7B</td> |
|
<td>43.7</td> |
|
<td>23.8</td> |
|
<td>16.7</td> |
|
<td>2.4</td> |
|
<td>871</td> |
|
<td>808</td> |
|
<td>875</td> |
|
<td>2.8</td> |
|
<td>8.2</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Mini-Omni</td> |
|
<td>1B</td> |
|
<td>22.0</td> |
|
<td>12.8</td> |
|
<td>6.9</td> |
|
<td>2.5</td> |
|
<td>926</td> |
|
<td>803</td> |
|
<td>865</td> |
|
<td>3.4</td> |
|
<td>10.0</td> |
|
</tr> |
|
<tr style="background-color: #e6f2ff;"> |
|
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> |
|
<td>8B</td> |
|
<td><u>61.0</u></td> |
|
<td><u>40.0</u></td> |
|
<td><u>40.2</u></td> |
|
<td><u>5.1</u></td> |
|
<td><u>1088</u></td> |
|
<td><u>1163</u></td> |
|
<td><u>1131</u></td> |
|
<td><strong>4.2</strong></td> |
|
<td>9.8</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
All results are from AudioEvals, and the evaluation methods along with further details can be found in <a href="https://github.com/OpenBMB/UltraEval-Audio" target="_blank">AudioEvals</a>.<br><br> |
|
|
|
**Voice Cloning** |
|
|
|
<div align="center"> |
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th align="left">Task</th> |
|
<th colspan="2">Voice cloning</th> |
|
</tr> |
|
<tr> |
|
<th align="left">Metric</th> |
|
<th>SIMO↑</th> |
|
<th>SIMO↑</th> |
|
</tr> |
|
<tr> |
|
<th align="left">Dataset</th> |
|
<th>Seed-TTS test-zh</th> |
|
<th>Seed-TTS test-en</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td nowrap="nowrap" align="left">F5-TTS</td> |
|
<td><strong>76</strong></td> |
|
<td><strong>67</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">CosyVoice</td> |
|
<td><u>75</u></td> |
|
<td><u>64</u></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">FireRedTTS</td> |
|
<td>63</td> |
|
<td>46</td> |
|
</tr> |
|
<tr style="background-color: #e6f2ff;"> |
|
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> |
|
<td>57</td> |
|
<td>47</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and a voice imitation (TTS) |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Click to view multimodal live streaming results.</summary> |
|
|
|
**Multimodal Live Streaming**: results on StreamingBench |
|
|
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th align="left">Model</th> |
|
<th>Size</th> |
|
<th>Real-Time Video Understanding</th> |
|
<th>Omni-Source Understanding</th> |
|
<th>Contextual Understanding</th> |
|
<th>Overall</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="7" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Gemini 1.5 Pro</td> |
|
<td>-</td> |
|
<td><u>77.4</u></td> |
|
<td><strong>67.8</strong></td> |
|
<td><strong>51.1</strong></td> |
|
<td><strong>70.3</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4o</td> |
|
<td>-</td> |
|
<td>74.5</td> |
|
<td>51.0</td> |
|
<td><u>48.0</u></td> |
|
<td>64.1</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Claude-3.5-Sonnet</td> |
|
<td>-</td> |
|
<td>74.0</td> |
|
<td>41.4</td> |
|
<td>37.8</td> |
|
<td>59.7</td> |
|
</tr> |
|
<tr> |
|
<td colspan="9" align="left"><strong>Open-source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">VILA-1.5</td> |
|
<td>8B</td> |
|
<td>61.5</td> |
|
<td>37.5</td> |
|
<td>26.7</td> |
|
<td>49.5</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">LongVA</td> |
|
<td>7B</td> |
|
<td>63.1</td> |
|
<td>35.9</td> |
|
<td>30.2</td> |
|
<td>50.7</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">LLaVA-Next-Video-34B</td> |
|
<td>34B</td> |
|
<td>69.8</td> |
|
<td>41.7</td> |
|
<td>34.3</td> |
|
<td>56.7</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2-VL-7B</td> |
|
<td>8B</td> |
|
<td>71.2</td> |
|
<td>40.7</td> |
|
<td>33.1</td> |
|
<td>57.0</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL2-8B</td> |
|
<td>8B</td> |
|
<td>70.1</td> |
|
<td>42.7</td> |
|
<td>34.1</td> |
|
<td>57.0</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">VITA-1.5</td> |
|
<td>8B</td> |
|
<td>70.9</td> |
|
<td>40.8</td> |
|
<td>35.8</td> |
|
<td>57.4</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">LLaVA-OneVision-7B</td> |
|
<td>8B</td> |
|
<td>74.3</td> |
|
<td>40.8</td> |
|
<td>31.0</td> |
|
<td>58.4</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternLM-XC2.5-OL-7B</td> |
|
<td>8B</td> |
|
<td>75.4</td> |
|
<td>46.2</td> |
|
<td>33.6</td> |
|
<td>60.8</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V 2.6</td> |
|
<td>8B</td> |
|
<td>72.4</td> |
|
<td>40.2</td> |
|
<td>33.4</td> |
|
<td>57.7</td> |
|
</tr> |
|
<tr style="background-color: #e6f2ff;"> |
|
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td> |
|
<td>8B</td> |
|
<td><strong>79.9</strong></td> |
|
<td><u>53.4</u></td> |
|
<td>38.5</td> |
|
<td><u>66.0</u></td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
|
|
</details> |
|
|
|
|
|
### Examples <!-- omit in toc --> |
|
|
|
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition. |
|
|
|
|
|
<div style="display: flex; flex-direction: column; align-items: center;"> |
|
<img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;"> |
|
<img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;"> |
|
<img src="https://github.com/yiranyyu/MiniCPM-V-private/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;"> |
|
</div> |
|
|
|
|
|
|
|
|
|
## Online Demo |
|
Click here to try the online demo of **MiniCPM-o 2.6** on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn) server. |
|
|
|
|
|
## Usage |
|
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10: |
|
``` |
|
Pillow==10.1.0 |
|
torch==2.2.0 |
|
torchaudio==2.2.0 |
|
torchvision==0.17.0 |
|
transformers==4.44.2 |
|
librosa==0.9.0 |
|
soundfile==0.12.1 |
|
vector-quantize-pytorch==1.18.5 |
|
vocos==0.1.0 |
|
decord |
|
moviepy |
|
``` |
|
|
|
|
|
### Model initialization |
|
```python |
|
|
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
# load omni model default, the default init_vision/init_audio/init_tts is True |
|
# if load vision-only model, please set init_audio=False and init_tts=False |
|
# if load audio-only model, please set init_vision=False |
|
model = AutoModel.from_pretrained( |
|
'openbmb/MiniCPM-o-2_6', |
|
trust_remote_code=True, |
|
attn_implementation='sdpa', # sdpa or flash_attention_2 |
|
torch_dtype=torch.bfloat16, |
|
init_vision=True, |
|
init_audio=True, |
|
init_tts=True |
|
) |
|
|
|
|
|
model = model.eval().cuda() |
|
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True) |
|
|
|
# In addition to vision-only mode, tts processor and vocos also needs to be initialized |
|
model.init_tts() |
|
model.tts.float() |
|
``` |
|
### Omni mode |
|
we provide two inference modes: chat and streaming |
|
|
|
#### chat inference |
|
```python |
|
import math |
|
import numpy as np |
|
from PIL import Image |
|
from moviepy.editor import VideoFileClip |
|
import tempfile |
|
import librosa |
|
import soundfile as sf |
|
|
|
def get_video_chunk_content(video_path, flatten=True): |
|
video = VideoFileClip(video_path) |
|
print('video_duration:', video.duration) |
|
|
|
with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file: |
|
temp_audio_file_path = temp_audio_file.name |
|
video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000) |
|
audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True) |
|
num_units = math.ceil(video.duration) |
|
|
|
# 1 frame + 1s audio chunk |
|
contents= [] |
|
for i in range(num_units): |
|
frame = video.get_frame(i+1) |
|
image = Image.fromarray((frame).astype(np.uint8)) |
|
audio = audio_np[sr*i:sr*(i+1)] |
|
if flatten: |
|
contents.extend(["<unit>", image, audio]) |
|
else: |
|
contents.append(["<unit>", image, audio]) |
|
|
|
return contents |
|
|
|
video_path="/path/to/video" |
|
sys_msg = model.get_sys_prompt(mode='omni', language='en') |
|
# if use voice clone prompt, please set ref_audio |
|
# ref_audio_path = '/path/to/ref_audio' |
|
# ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True) |
|
# sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en') |
|
|
|
contents = get_video_chunk_content(video_path) |
|
msg = {"role":"user", "content": contents} |
|
msgs = [sys_msg, msg] |
|
|
|
# please set generate_audio=True and output_audio_path to save the tts result |
|
generate_audio = True |
|
output_audio_path = 'output.wav' |
|
|
|
res = model.chat( |
|
msgs=msgs, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
temperature=0.5, |
|
max_new_tokens=4096, |
|
omni_input=True, # please set omni_input=True when omni inference |
|
use_tts_template=True, |
|
generate_audio=generate_audio, |
|
output_audio_path=output_audio_path, |
|
max_slice_nums=1, |
|
use_image_id=False, |
|
return_dict=True |
|
) |
|
print(res) |
|
``` |
|
#### streaming inference |
|
```python |
|
# a new conversation need reset session first, it will reset the kv-cache |
|
model.reset_session() |
|
|
|
contents = get_video_chunk_content(video_path, flatten=False) |
|
session_id = '123' |
|
generate_audio = True |
|
|
|
# 1. prefill system prompt |
|
res = model.streaming_prefill( |
|
session_id=session_id, |
|
msgs=[sys_msg], |
|
tokenizer=tokenizer |
|
) |
|
|
|
# 2. prefill video/audio chunks |
|
for content in contents: |
|
msgs = [{"role":"user", "content": content}] |
|
res = model.streaming_prefill( |
|
session_id=session_id, |
|
msgs=msgs, |
|
tokenizer=tokenizer |
|
) |
|
|
|
# 3. generate |
|
res = model.streaming_generate( |
|
session_id=session_id, |
|
tokenizer=tokenizer, |
|
temperature=0.5, |
|
generate_audio=generate_audio |
|
) |
|
|
|
audios = [] |
|
text = "" |
|
|
|
if generate_audio: |
|
for r in res: |
|
audio_wav = r.audio_wav |
|
sampling_rate = r.sampling_rate |
|
txt = r.text |
|
|
|
audios.append(audio_wav) |
|
text += txt |
|
|
|
res = np.concatenate(audios) |
|
sf.write("output.wav", res, samplerate=sampling_rate) |
|
print("text:", text) |
|
print("audio saved to output.wav") |
|
else: |
|
for r in res: |
|
text += r['text'] |
|
print("text:", text) |
|
|
|
``` |
|
|
|
### Audio-Only mode |
|
#### Mimick |
|
```python |
|
mimick_prompt = "Please repeat each user's speech, including voice style and speech content." |
|
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True) |
|
msgs = [{'role': 'user', 'content': [mimick_prompt,audio_input]}] |
|
|
|
res = model.chat( |
|
msgs=msgs, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
max_new_tokens=128, |
|
use_tts_template=True, |
|
temperature=0.3, |
|
generate_audio=True, |
|
output_audio_path='output.wav', # save the tts result to output_audio_path |
|
) |
|
``` |
|
|
|
#### General Speech Conversation with Configurable Voices |
|
<details> <summary>Click to view the Python code for enabling MiniCPM-o 2.6 to interact with you in a specified voice.</summary> |
|
|
|
```python |
|
ref_audio, _ = librosa.load('./assert/voice_01.wav', sr=16000, mono=True) # load the reference audio |
|
|
|
# Audio RolePlay: # With this mode, model will role-play the character based on the audio prompt. |
|
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en') |
|
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} |
|
|
|
# Audio Assistant: # With this mode, model will speak with the voice in ref_audio as a AI assistant. |
|
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') |
|
# user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # Try to ask something! |
|
``` |
|
```python |
|
msgs = [sys_prompt, user_question] |
|
res = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
context=None, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
max_new_tokens=128, |
|
stream=False, |
|
stream_input=True, |
|
use_tts_template=True, |
|
generate_audio=True, |
|
temperature=0.3, |
|
output_audio_path='result.wav', |
|
) |
|
|
|
# round two |
|
history = msgs.append({'role': 'assistant', 'content': res}) |
|
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} |
|
msgs = history.append(user_question) |
|
res = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
context=None, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
max_new_tokens=128, |
|
stream=False, |
|
stream_input=True, |
|
use_tts_template=True, |
|
generate_audio=True, |
|
temperature=0.3, |
|
output_audio_path='result_round_2.wav', |
|
) |
|
print(res) |
|
``` |
|
|
|
</details> |
|
|
|
#### Addressing various audio tasks |
|
<details> |
|
<summary> Click to show Python code running MiniCPM-o 2.6 with specific audioQA task. </summary> |
|
|
|
```python |
|
''' |
|
Audio Understanding Task Prompt: |
|
Speech: |
|
ASR with ZH(same as AST en2zh): 请仔细听这段音频片段,并将其内容逐字记录。 |
|
ASR with EN(same as AST zh2en): Please listen to the audio snippet carefully and transcribe the content. |
|
Speaker Analysis: Based on the speaker's content, speculate on their gender, condition, age range, and health status. |
|
General Audio: |
|
Audio Caption: Summarize the main content of the audio. |
|
Sound Scene Tagging: Utilize one keyword to convey the audio's content or the associated scene. |
|
''' |
|
task_prompt = "\n" |
|
audio_input, _ = librosa.load('xxx.wav', sr=16000, mono=True) |
|
|
|
msgs = [{'role': 'user', 'content': [task_prompt,audio_input]}] |
|
|
|
res = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
context=None, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
max_new_tokens=128, |
|
stream=False, |
|
stream_input=True, |
|
use_tts_template=True, |
|
generate_audio=True, |
|
temperature=0.3, |
|
output_audio_path='result.wav', |
|
) |
|
print(res) |
|
``` |
|
```python |
|
''' |
|
Speech Generation Task Prompt: |
|
Human Instruction-to-Speech: see https://voxinstruct.github.io/VoxInstruct/ |
|
Example: |
|
# 在新闻中,一个年轻男性兴致勃勃地说:“祝福亲爱的祖国母亲美丽富强!”他用低音调和低音量,慢慢地说出了这句话。 |
|
# Delighting in a surprised tone, an adult male with low pitch and low volume comments:"One even gave my little dog a biscuit" This dialogue takes place at a leisurely pace, delivering a sense of excitement and surprise in the context. |
|
|
|
Voice Cloning or Voice Creation: With this mode, model will act like a TTS model. |
|
''' |
|
# Human Instruction-to-Speech: |
|
task_prompt = '' #Try to make some Human Instruction-to-Speech prompt |
|
msgs = [{'role': 'user', 'content': [task_prompt]}] # you can try to use the same audio question |
|
|
|
# Voice Cloning mode: With this mode, model will act like a TTS model. |
|
# sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en') |
|
# text_prompt = f"Please read the text below." |
|
# user_question = {'role': 'user', 'content': [text_prompt, "content that you want to read"]} # using same voice in sys_prompt to read the text. (Voice Cloning) |
|
# user_question = {'role': 'user', 'content': [text_prompt, librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # using same voice in sys_prompt to read 'xxx.wav'. (Voice Creation) |
|
|
|
msgs = [sys_prompt, user_question] |
|
res = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
context=None, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
max_new_tokens=128, |
|
stream=False, |
|
stream_input=True, |
|
use_tts_template=True, |
|
generate_audio=True, |
|
temperature=0.3, |
|
output_audio_path='result.wav', |
|
) |
|
|
|
|
|
``` |
|
|
|
</details> |
|
|
|
### Vision-Only mode |
|
|
|
`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6` |
|
|
|
#### chat with single image |
|
```python |
|
# test.py |
|
image = Image.open('xx.jpg').convert('RGB') |
|
question = 'What is in the image?' |
|
msgs = [{'role': 'user', 'content': [image, question]}] |
|
res = model.chat( |
|
image=None, |
|
msgs=msgs, |
|
tokenizer=tokenizer |
|
) |
|
print(res) |
|
|
|
## if you want to use streaming, please make sure sampling=True and stream=True |
|
## the model.chat will return a generator |
|
res = model.chat( |
|
msgs=msgs, |
|
tokenizer=tokenizer, |
|
sampling=True, |
|
stream=True |
|
) |
|
generated_text = "" |
|
for new_text in res: |
|
generated_text += new_text |
|
print(new_text, flush=True, end='') |
|
``` |
|
|
|
#### Chat with multiple images |
|
<details> |
|
<summary> Click to show Python code running MiniCPM-o 2.6 with multiple images input. </summary> |
|
|
|
```python |
|
image1 = Image.open('image1.jpg').convert('RGB') |
|
image2 = Image.open('image2.jpg').convert('RGB') |
|
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.' |
|
msgs = [{'role': 'user', 'content': [image1, image2, question]}] |
|
answer = model.chat( |
|
msgs=msgs, |
|
tokenizer=tokenizer |
|
) |
|
print(answer) |
|
``` |
|
</details> |
|
|
|
#### In-context few-shot learning |
|
<details> |
|
<summary> Click to view Python code running MiniCPM-o 2.6 with few-shot input. </summary> |
|
|
|
```python |
|
question = "production date" |
|
image1 = Image.open('example1.jpg').convert('RGB') |
|
answer1 = "2023.08.04" |
|
image2 = Image.open('example2.jpg').convert('RGB') |
|
answer2 = "2007.04.24" |
|
image_test = Image.open('test.jpg').convert('RGB') |
|
msgs = [ |
|
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]}, |
|
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]}, |
|
{'role': 'user', 'content': [image_test, question]} |
|
] |
|
answer = model.chat( |
|
msgs=msgs, |
|
tokenizer=tokenizer |
|
) |
|
print(answer) |
|
``` |
|
</details> |
|
|
|
#### Chat with video |
|
<details> |
|
<summary> Click to view Python code running MiniCPM-o 2.6 with video input. </summary> |
|
|
|
```python |
|
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number |
|
def encode_video(video_path): |
|
def uniform_sample(l, n): |
|
gap = len(l) / n |
|
idxs = [int(i * gap + gap / 2) for i in range(n)] |
|
return [l[i] for i in idxs] |
|
vr = VideoReader(video_path, ctx=cpu(0)) |
|
sample_fps = round(vr.get_avg_fps() / 1) # FPS |
|
frame_idx = [i for i in range(0, len(vr), sample_fps)] |
|
if len(frame_idx) > MAX_NUM_FRAMES: |
|
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES) |
|
frames = vr.get_batch(frame_idx).asnumpy() |
|
frames = [Image.fromarray(v.astype('uint8')) for v in frames] |
|
print('num frames:', len(frames)) |
|
return frames |
|
video_path ="video_test.mp4" |
|
frames = encode_video(video_path) |
|
question = "Describe the video" |
|
msgs = [ |
|
{'role': 'user', 'content': frames + [question]}, |
|
] |
|
# Set decode params for video |
|
params={} |
|
params["use_image_id"] = False |
|
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448 |
|
answer = model.chat( |
|
msgs=msgs, |
|
tokenizer=tokenizer, |
|
**params |
|
) |
|
print(answer) |
|
``` |
|
</details> |
|
|
|
Please look at [GitHub](https://github.com/OpenBMB/MiniCPM-V) for more detail about usage. |
|
|
|
|
|
## Inference with llama.cpp<a id="llamacpp"></a> |
|
MiniCPM-o 2.6 can run with llama.cpp. See our fork of [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv) for more detail. |
|
|
|
|
|
## Int4 quantized version |
|
Download the int4 quantized version for lower GPU memory (7GB) usage: [MiniCPM-o-2_6-int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4). |
|
|
|
|
|
## License |
|
#### Model License |
|
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. |
|
* The usage of MiniCPM-o and MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). |
|
* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-o 2.6 weights are also available for free commercial use. |
|
|
|
|
|
#### Statement |
|
* As an LMM, MiniCPM-o 2.6 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 2.6 does not represent the views and positions of the model developers |
|
* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. |
|
|
|
## Key Techniques and Other Multimodal Projects |
|
|
|
👏 Welcome to explore key techniques of MiniCPM-o 2.6 and other multimodal projects of our team: |
|
|
|
[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) |
|
|
|
## Citation |
|
|
|
If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️! |
|
|
|
```bib |
|
@article{yao2024minicpm, |
|
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, |
|
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, |
|
journal={arXiv preprint arXiv:2408.01800}, |
|
year={2024} |
|
} |
|
``` |
|
|