update readme
Browse files
README.md
CHANGED
@@ -29,14 +29,14 @@ tags:
|
|
29 |
|
30 |
## MiniCPM-o 2.6
|
31 |
|
32 |
-
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for
|
33 |
|
34 |
- π₯ **Leading Visual Capability.**
|
35 |
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
|
36 |
|
37 |
-
- π **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual
|
38 |
|
39 |
-
- π¬ **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support
|
40 |
|
41 |
- πͺ **Strong OCR Capability and Others.**
|
42 |
Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
|
@@ -47,7 +47,7 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr
|
|
47 |
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
|
48 |
|
49 |
- π« **Easy Usage.**
|
50 |
-
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [
|
51 |
|
52 |
|
53 |
**Model Architecture.**
|
@@ -60,6 +60,7 @@ MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github
|
|
60 |
<img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
|
61 |
</div>
|
62 |
|
|
|
63 |
### Evaluation <!-- omit in toc -->
|
64 |
|
65 |
<div align="center">
|
@@ -562,7 +563,7 @@ Note: For proprietary models, we calculate token density based on the image enco
|
|
562 |
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
563 |
</tr>
|
564 |
<tr>
|
565 |
-
<td nowrap="nowrap" align="left">Qwen2-Audio</td>
|
566 |
<td>8B</td>
|
567 |
<td>-</td>
|
568 |
<td>7.5</td>
|
@@ -814,7 +815,7 @@ All results are from AudioEvals, and the evaluation methods along with further d
|
|
814 |
<td><strong>70.3</strong></td>
|
815 |
</tr>
|
816 |
<tr>
|
817 |
-
<td nowrap="nowrap" align="left">GPT-4o</td>
|
818 |
<td>-</td>
|
819 |
<td>74.5</td>
|
820 |
<td>51.0</td>
|
@@ -920,7 +921,13 @@ All results are from AudioEvals, and the evaluation methods along with further d
|
|
920 |
|
921 |
### Examples <!-- omit in toc -->
|
922 |
|
923 |
-
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw
|
|
|
|
|
|
|
|
|
|
|
|
|
924 |
|
925 |
|
926 |
<div style="display: flex; flex-direction: column; align-items: center;">
|
|
|
29 |
|
30 |
## MiniCPM-o 2.6
|
31 |
|
32 |
+
**MiniCPM-o 2.6** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
|
33 |
|
34 |
- π₯ **Leading Visual Capability.**
|
35 |
MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4o-202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet** for single image understanding. It also **outperforms GPT-4V and Claude 3.5 Sonnet** in mutli-image and video understanding, and shows promising in-context learning capability.
|
36 |
|
37 |
+
- π **State-of-the-art Speech Capability.** MiniCPM-o 2.6 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It **outperforms GPT-4o-realtime on audio understanding tasks** such as ASR and STT translation, and shows **state-of-the-art performance on speech conversation in both semantic and acoustic evaluations in the open-source community**. It also allows for fun features such as emotion/speed/style control, end-to-end voice cloning, role play, etc.
|
38 |
|
39 |
+
- π¬ **Strong Multimodal Live Streaming Capability.** As a new feature, MiniCPM-o 2.6 can **accept continous video and audio streams independent of user queries, and support real-time speech interaction**. It **outperforms GPT-4o-202408 and Claude 3.5 Sonnet and shows state-of-art performance in open-source community on StreamingBench**, a comprehensive benchmark for real-time video understanding, omni-source (video & audio) understanding, and multimodal contextual understanding.
|
40 |
|
41 |
- πͺ **Strong OCR Capability and Others.**
|
42 |
Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves **state-of-the-art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT-4o-202405**.
|
|
|
47 |
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
|
48 |
|
49 |
- π« **Easy Usage.**
|
50 |
+
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [server](https://minicpm-omni-webdemo-us.modelbest.cn/).
|
51 |
|
52 |
|
53 |
**Model Architecture.**
|
|
|
60 |
<img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/minicpm-o-26-framework.png" , width=80%>
|
61 |
</div>
|
62 |
|
63 |
+
|
64 |
### Evaluation <!-- omit in toc -->
|
65 |
|
66 |
<div align="center">
|
|
|
563 |
<td colspan="11" align="left"><strong>Open-Source</strong></td>
|
564 |
</tr>
|
565 |
<tr>
|
566 |
+
<td nowrap="nowrap" align="left">Qwen2-Audio-Base</td>
|
567 |
<td>8B</td>
|
568 |
<td>-</td>
|
569 |
<td>7.5</td>
|
|
|
815 |
<td><strong>70.3</strong></td>
|
816 |
</tr>
|
817 |
<tr>
|
818 |
+
<td nowrap="nowrap" align="left">GPT-4o-202408</td>
|
819 |
<td>-</td>
|
820 |
<td>74.5</td>
|
821 |
<td>51.0</td>
|
|
|
921 |
|
922 |
### Examples <!-- omit in toc -->
|
923 |
|
924 |
+
We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw-speed recording on an iPad Pro and a Web demo.
|
925 |
+
|
926 |
+
<div align="center">
|
927 |
+
<a href="https://youtu.be/JFJg9KZ_iZk"><img src="https://github.com/OpenBMB/MiniCPM-o/raw/main/assets/o-2dot6-demo-video-preview.png", width=70%></a>
|
928 |
+
</div>
|
929 |
+
|
930 |
+
<br>
|
931 |
|
932 |
|
933 |
<div style="display: flex; flex-direction: column; align-items: center;">
|