--- pipeline_tag: image-text-to-text datasets: - openbmb/RLAIF-V-Dataset library_name: transformers language: - multilingual tags: - minicpm-v - vision - ocr - multi-image - video - custom_code ---

A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone

[GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Demo](http://211.93.21.133:8889/) ## MiniCPM-V 4.0 **MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include: - 🔥 **Leading Visual Capability.** With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding. - 🚀 **Superior Efficiency.** Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests. - 💫 **Easy Usage.** MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples. ### Evaluation

Click to view single image results on OpenCompass.

model	Size	Opencompass	OCRBench	MathVista	HallusionBench	MMMU	MMVet	MMBench V1.1	MMStar	AI2D
Proprietary
GPT-4v-20240409	-	63.5	656	55.2	43.9	61.7	67.5	79.8	56.0	78.6
Gemini-1.5-Pro	-	64.5	754	58.3	45.6	60.6	64.0	73.9	59.1	79.1
GPT-4.1-mini-20250414	-	68.9	840	70.9	49.3	55.0	74.3	80.9	60.9	76.0
Claude 3.5 Sonnet-20241022	-	70.6	798	65.3	55.5	66.4	70.1	81.7	65.1	81.2
Open-source
Qwen2.5-VL-3B-Instruct	3.8B	64.5	828	61.2	46.6	51.2	60.0	76.8	56.3	81.4
InternVL2.5-4B	3.7B	65.1	820	60.8	46.6	51.8	61.5	78.2	58.7	81.4
Qwen2.5-VL-7B-Instruct	8.3B	70.9	888	68.1	51.9	58.0	69.7	82.2	64.1	84.3
InternVL2.5-8B	8.1B	68.1	821	64.5	49.0	56.2	62.8	82.5	63.2	84.6
MiniCPM-V-2.6	8.1B	65.2	852	60.8	48.1	49.8	60.0	78.0	57.5	82.1
MiniCPM-o-2.6	8.7B	70.2	889	73.3	51.1	50.9	67.2	80.6	63.3	86.1
MiniCPM-V-4.0	4.1B	69.0	894	66.9	50.8	51.2	68.0	79.7	62.8	82.9

Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench.

model	Size	ChartQA	MME	RealWorldQA	TextVQA	DocVQA	MathVision	DynaMath	WeMath	Obj Hal		MM Hal
										CHAIRs↓	CHAIRi↓	score avg@3↑	hall rate avg@3↓
Proprietary
GPT-4v-20240409	-	78.5	1927	61.4	78.0	88.4	-	-	-	-	-	-	-
Gemini-1.5-Pro	-	87.2	-	67.5	78.8	93.1	41.0	31.5	50.5	-	-	-	-
GPT-4.1-mini-20250414	-	-	-	-	-	-	45.3	47.7	-	-	-	-	-
Claude 3.5 Sonnet-20241022	-	90.8	-	60.1	74.1	95.2	35.6	35.7	44.0	-	-	-	-
Open-source
Qwen2.5-VL-3B-Instruct	3.8B	84.0	2157	65.4	79.3	93.9	21.9	13.2	22.9	18.3	10.8	3.9	33.3
InternVL2.5-4B	3.7B	84.0	2338	64.3	76.8	91.6	18.4	15.2	21.2	13.7	8.7	3.2	46.5
Qwen2.5-VL-7B-Instruct	8.3B	87.3	2347	68.5	84.9	95.7	25.4	21.8	36.2	13.3	7.9	4.1	31.6
InternVL2.5-8B	8.1B	84.8	2344	70.1	79.1	93.0	17.0	9.4	23.5	18.3	11.6	3.6	37.2
MiniCPM-V-2.6	8.1B	79.4	2348	65.0	80.1	90.8	17.5	9.0	20.4	7.3	4.7	4.0	29.9
MiniCPM-o-2.6	8.7B	86.9	2372	68.1	82.0	93.5	21.7	10.4	25.2	6.3	3.4	4.1	31.3
MiniCPM-V-4.0	4.1B	84.4	2298	68.5	80.8	92.9	20.7	14.2	32.7	6.3	3.5	4.1	29.2

Click to view multi-image and video understanding results on Mantis, Blink and Video-MME.

model	Size	Mantis	Blink	Video-MME
				wo subs	w subs
Proprietary
GPT-4v-20240409	-	62.7	54.6	59.9	63.3
Gemini-1.5-Pro	-	-	59.1	75.0	81.3
GPT-4o-20240513	-	-	68.0	71.9	77.2
Open-source
Qwen2.5-VL-3B-Instruct	3.8B	-	47.6	61.5	67.6
InternVL2.5-4B	3.7B	62.7	50.8	62.3	63.6
Qwen2.5-VL-7B-Instruct	8.3B	-	56.4	65.1	71.6
InternVL2.5-8B	8.1B	67.7	54.8	64.2	66.9
MiniCPM-V-2.6	8.1B	69.1	53.0	60.9	63.6
MiniCPM-o-2.6	8.7B	71.9	56.7	63.9	69.6
MiniCPM-V-4.0	4.1B	71.4	54.0	61.2	65.8

### Examples

Run locally on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md).

## Usage ```python from PIL import Image import torch from transformers import AutoModel, AutoTokenizer model_path = 'openbmb/MiniCPM-V-4' model = AutoModel.from_pretrained(model_path, trust_remote_code=True, # sdpa or flash_attention_2, no eager attn_implementation='sdpa', torch_dtype=torch.bfloat16) model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained( model_path, trust_remote_code=True) image = Image.open('./assets/single.png').convert('RGB') # First round chat question = "What is the landform in the picture?" msgs = [{'role': 'user', 'content': [image, question]}] answer = model.chat( msgs=msgs, image=image, tokenizer=tokenizer ) print(answer) # Second round chat, pass history context of multi-turn conversation msgs.append({"role": "assistant", "content": [answer]}) msgs.append({"role": "user", "content": [ "What should I pay attention to when traveling here?"]}) answer = model.chat( msgs=msgs, image=None, tokenizer=tokenizer ) print(answer) ``` ## License #### Model License * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. * The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM-o/blob/main/MiniCPM%20Model%20License.md). * The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-V 2.6 weights are also available for free commercial use. #### Statement * As an LMM, MiniCPM-V 4.0 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 4.0 does not represent the views and positions of the model developers * We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. ## Key Techniques and Other Multimodal Projects 👏 Welcome to explore key techniques of MiniCPM-V 2.6 and other multimodal projects of our team: [VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) ## Citation If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！ ```bib @article{yao2024minicpm, title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, journal={Nat Commun 16, 5509 (2025)}, year={2025} } ```