OpenGVLab
/

Mini-InternVL-Chat-2B-V1-5

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

cuierfei commited on May 25, 2024

Commit

0e12404

·

verified ·

1 Parent(s): 2780d44

Update README.md

Files changed (1) hide show

README.md +1 -5

README.md CHANGED Viewed

@@ -19,11 +19,7 @@ pipeline_tag: visual-question-answering
 \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
-We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
-We introduce three simple designs:
-1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
-2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 &times; 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
-3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
 ## Model Details

 \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\]  \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\]  \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
+We are delighted to introduce Mini-InternVL-Chat-2B-V1-5. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model InternViT-6B-448px-V1-5 down to 300M and used InternLM2-Chat-1.8B as our language model. This resulted in a small multimodal model with excellent performance.
 ## Model Details