infly
/

InfMLLM2_7B_chat

Model card Files Files and versions

QianYEee commited on Oct 30, 2024

Commit

6582ce4

·

verified ·

1 Parent(s): 8a096e8

Update README.md

Files changed (1) hide show

README.md +77 -3

README.md CHANGED Viewed

@@ -1,3 +1,77 @@
----
-license: mit
----

+---
+license: mit
+---
+## INF-MLLM2: High-Resolution Image and Document Understanding
+In INF-MLLM2, we have introduced significant updates, particularly in high-resolution image processing, document understanding and OCR.
+The key improvements include the following:
+- Dynamic Image Resolution Support: The model now supports dynamic image resolution up to 1344x1344 pixels.
+- Enhanced OCR Capabilities: The model has significantly improved OCR capabilities, enabling robust document parsing, table and formula recognition, document layout analysis, and key information extraction.
+- Advanced Training Strategies: We employed a progressive multi-stage training strategy along with an enhanced data mixup strategy tailored for image and document multitask scenarios.
+<p align="center">
+<img src="docs/model.png" alt="" width="100%"/>
+</p>
+[Technical Report](docs/tech_report.pdf)
+### Install
+```bash
+conda create -n infmllm2 python=3.9
+conda activate infmllm2
+conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.1.2
+pip install transformers==4.40.2 timm==0.5.4 pillow==10.4.0 sentencepiece==0.1.99
+pip install bigmodelvis peft einops spacy
+```
+### Model Zoo
+We have released the INF-MLLM2-7B model on Hugging Face.
+- [INF-MLLM2-7B](https://huggingface.co/QianYEee/InfMLLM2_7B_chat)
+### Evaluation
+The comparison with general multimodal LLM across multiple benchmarks and OCR-related tasks.
+<p align="center">
+<img src="docs/results_1.jpg" alt="" width="90%"/>
+</p>
+The comparison with OCR-free multimodal LLM for content parsing of documents/tables/formulas.
+<p align="center">
+<img src="docs/results_2.jpg" alt="" width="90%"/>
+</p>
+The comparison with OCR-free multimodal LLM for key information extraction.
+<p align="center">
+<img src="docs/results_3.jpg" alt="" width="90%"/>
+</p>
+### Visualization
+<p align="center">
+<img src="docs/demo1.png" alt="" width="90%"/>
+</p>
+<p align="center">
+<img src="docs/demo2.png" alt="" width="90%"/>
+</p>
+<p align="center">
+<img src="docs/demo3.png" alt="" width="90%"/>
+</p>
+<p align="center">
+<img src="docs/table_equation.png" alt="" width="90%"/>
+</p>
+### Usage
+The inference process for INF-MLLM2 is straightforward. We also provide a simple [demo.py](demo.py) script as a reference.
+```bash
+CUDA_VISIBLE_DEVICES=0 python demo.py --model_path /path/to/InfMLLM2_7B_chat
+```
+## Acknowledgement
+We thank the great work from [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT.git) and [InternLM-XComposer](https://github.com/InternLM/InternLM-XComposer.git).