cuierfei commited on
Commit
0e12404
·
verified ·
1 Parent(s): 2780d44

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -5
README.md CHANGED
@@ -19,11 +19,7 @@ pipeline_tag: visual-question-answering
19
 
20
  \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
21
 
22
- We introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding.
23
- We introduce three simple designs:
24
- 1. Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model---InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.
25
- 2. Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448 × 448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.
26
- 3. High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks.
27
 
28
 
29
  ## Model Details
 
19
 
20
  \[[InternVL 1.5 Technical Report](https://arxiv.org/abs/2404.16821)\] \[[CVPR Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
21
 
22
+ We are delighted to introduce Mini-InternVL-Chat-2B-V1-5. In the era of large language models, many researchers have started to focus on smaller language models, such as Gemma-2B, Qwen-1.8B, and InternLM2-1.8B. Inspired by their efforts, we have distilled our vision foundation model InternViT-6B-448px-V1-5 down to 300M and used InternLM2-Chat-1.8B as our language model. This resulted in a small multimodal model with excellent performance.
 
 
 
 
23
 
24
 
25
  ## Model Details