|
--- |
|
license: mit |
|
datasets: |
|
- laion/laion2B-en |
|
- laion/laion-coco |
|
- laion/laion2B-multi |
|
- kakaobrain/coyo-700m |
|
- conceptual_captions |
|
- wanng/wukong100m |
|
pipeline_tag: image-feature-extraction |
|
--- |
|
|
|
# InternVL-14B-224px |
|
|
|
[\[๐ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[๐ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[๐ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[๐ Mini-InternVL\]](https://arxiv.org/abs/2410.16261) [\[๐ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) |
|
|
|
[\[๐ Blog\]](https://internvl.github.io/blog/) [\[๐จ๏ธ Chat Demo\]](https://internvl.opengvlab.com/) [\[๐ค HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[๐ Quick Start\]](#quick-start) [\[๐ Documents\]](https://internvl.readthedocs.io/en/latest/) |
|
|
|
<div align="center"> |
|
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png"> |
|
</div> |
|
|
|
## Model Details |
|
- **Model Type:** vision-language foundation model |
|
- **Support Tasks:** zero-shot image/video classification, image-text/video retrieval, image captioning |
|
- **Model Stats:** |
|
- Params: 14B |
|
- Image size: 224 x 224 |
|
- **Pretrain Dataset:** LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, LAION-multi |
|
|
|
## Zero-Shot Performance |
|
|
|
See this [document](https://github.com/OpenGVLab/InternVL/tree/main/clip_benchmark#-evaluation-zero-shot-image-classification) for more details about the zero-shot evaluation. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/KfsrXioPU77T48sRb60oL.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/q5UkfrEix6w3mnn_1w4ja.png) |
|
|
|
## Quick Start |
|
|
|
> \[!Warning\] |
|
> ๐จ Note: the prefix `'summarize:'` and `tokenizer.pad_token_id = 0` are necessary. Their absence will lead to abnormal results. |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, CLIPImageProcessor |
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = AutoModel.from_pretrained( |
|
'OpenGVLab/InternVL-14B-224px', |
|
torch_dtype=torch.bfloat16, |
|
low_cpu_mem_usage=True, |
|
trust_remote_code=True).cuda().eval() |
|
|
|
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px') |
|
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
'OpenGVLab/InternVL-14B-224px', use_fast=False, add_eos_token=True) |
|
tokenizer.pad_token_id = 0 # set pad_token_id to 0 |
|
|
|
images = [ |
|
Image.open('./examples/image1.jpg').convert('RGB'), |
|
Image.open('./examples/image2.jpg').convert('RGB'), |
|
Image.open('./examples/image3.jpg').convert('RGB') |
|
] |
|
prefix = 'summarize:' |
|
texts = [ |
|
prefix + 'a photo of a red panda', # English |
|
prefix + 'ไธๅผ ็็ซ็็
ง็', # Chinese |
|
prefix + 'ไบๅนใฎ็ซใฎๅ็' # Japanese |
|
] |
|
|
|
pixel_values = image_processor(images=images, return_tensors='pt').pixel_values |
|
pixel_values = pixel_values.to(torch.bfloat16).cuda() |
|
input_ids = tokenizer(texts, return_tensors='pt', max_length=80, |
|
truncation=True, padding='max_length').input_ids.cuda() |
|
|
|
# InternVL-C |
|
logits_per_image, logits_per_text = model( |
|
image=pixel_values, text=input_ids, mode='InternVL-C') |
|
probs = logits_per_image.softmax(dim=-1) |
|
# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08], |
|
# [2.2949e-02, 9.7656e-01, 5.9903e-06], |
|
# [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0', |
|
# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>) |
|
|
|
# InternVL-G |
|
logits_per_image, logits_per_text = model( |
|
image=pixel_values, text=input_ids, mode='InternVL-G') |
|
probs = logits_per_image.softmax(dim=-1) |
|
# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08], |
|
# [8.6060e-03, 9.9219e-01, 2.8759e-06], |
|
# [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0', |
|
# dtype=torch.bfloat16, grad_fn=<SoftmaxBackward0>) |
|
|
|
# please set add_eos_token to False for generation |
|
tokenizer.add_eos_token = False |
|
image = Image.open('./examples/image1.jpg').convert('RGB') |
|
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values |
|
pixel_values = pixel_values.to(torch.bfloat16).cuda() |
|
|
|
tokenized = tokenizer("English caption:", return_tensors='pt') |
|
pred = model.generate( |
|
pixel_values=pixel_values, |
|
input_ids=tokenized.input_ids.cuda(), |
|
attention_mask=tokenized.attention_mask.cuda(), |
|
num_beams=5, |
|
min_new_tokens=8, |
|
) |
|
caption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip() |
|
# English caption: a red panda sitting on top of a wooden platform |
|
``` |
|
|
|
## Citation |
|
|
|
If you find this project useful in your research, please consider citing: |
|
|
|
```BibTeX |
|
@article{chen2024expanding, |
|
title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling}, |
|
author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others}, |
|
journal={arXiv preprint arXiv:2412.05271}, |
|
year={2024} |
|
} |
|
@article{gao2024mini, |
|
title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance}, |
|
author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others}, |
|
journal={arXiv preprint arXiv:2410.16261}, |
|
year={2024} |
|
} |
|
@article{chen2024far, |
|
title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites}, |
|
author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others}, |
|
journal={arXiv preprint arXiv:2404.16821}, |
|
year={2024} |
|
} |
|
@inproceedings{chen2024internvl, |
|
title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks}, |
|
author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others}, |
|
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, |
|
pages={24185--24198}, |
|
year={2024} |
|
} |
|
``` |
|
|