|
--- |
|
license: other |
|
pipeline_tag: visual-question-answering |
|
--- |
|
|
|
<p align="center"> |
|
<img src="logo_en.png" width="600"/> |
|
<p> |
|
|
|
<p align="center"> |
|
<b><font size="6">InternLM-XComposer-2.5-OL</font></b> |
|
<p> |
|
|
|
<div align="center"> |
|
|
|
[💻Github Repo](https://github.com/InternLM/InternLM-XComposer) |
|
|
|
</div> |
|
|
|
|
|
**InternLM-XComposer2.5-OL**, a comprehensive multimodal system for long-term streaming video and audio interactions. |
|
|
|
### Import from Transformers |
|
To load the base LLM model using Transformers, use the following code: |
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
torch.set_grad_enabled(False) |
|
|
|
# init model and tokenizer |
|
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half() |
|
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True) |
|
model.tokenizer = tokenizer |
|
``` |
|
|
|
To load the base audio model using MS-Swift, use the following code: |
|
```python |
|
import os |
|
os.environ['USE_HF'] = 'True' |
|
|
|
import torch |
|
from swift.llm import ( |
|
get_model_tokenizer, get_template, ModelType, |
|
get_default_template_type, inference |
|
) |
|
from swift.utils import seed_everything |
|
|
|
model_type = ModelType.qwen2_audio_7b_instruct |
|
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b' |
|
template_type = get_default_template_type(model_type) |
|
print(f'template_type: {template_type}') |
|
|
|
model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio', |
|
model_kwargs={'device_map': 'cuda:0'}) |
|
model.generation_config.max_new_tokens = 256 |
|
template = get_template(template_type, tokenizer) |
|
seed_everything(42) |
|
``` |
|
|
|
|
|
## Quickstart |
|
|
|
We provide simple examples below to show how to use InternLM-XComposer-2.5-OL with 🤗 Transformers. For complete guide, please refer to [here](https://github.com/InternLM/InternLM-XComposer/blob/main/InternLM-XComposer-2.5-OmniLive/examples/README.md). |
|
|
|
|
|
<details> |
|
<summary> |
|
<b>Audio Understanding</b> |
|
</summary> |
|
|
|
```python |
|
import os |
|
os.environ['USE_HF'] = 'True' |
|
|
|
import torch |
|
from swift.llm import ( |
|
get_model_tokenizer, get_template, ModelType, |
|
get_default_template_type, inference |
|
) |
|
from swift.utils import seed_everything |
|
|
|
model_type = ModelType.qwen2_audio_7b_instruct |
|
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b' |
|
template_type = get_default_template_type(model_type) |
|
print(f'template_type: {template_type}') |
|
|
|
model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio', |
|
model_kwargs={'device_map': 'cuda:0'}) |
|
model.generation_config.max_new_tokens = 256 |
|
template = get_template(template_type, tokenizer) |
|
seed_everything(42) |
|
|
|
# Chinese ASR |
|
query = '<audio>Detect the language and recognize the speech.' |
|
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3') |
|
print(f'query: {query}') |
|
print(f'response: {response}') |
|
``` |
|
|
|
</details> |
|
|
|
|
|
<details> |
|
<summary> |
|
<b>Image Understanding</b> |
|
</summary> |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
torch.set_grad_enabled(False) |
|
|
|
# init model and tokenizer |
|
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half() |
|
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True) |
|
model.tokenizer = tokenizer |
|
|
|
query = 'Analyze the given image in a detail manner' |
|
image = ['examples/images/dubai.png'] |
|
with torch.autocast(device_type='cuda', dtype=torch.float16): |
|
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True) |
|
print(response) |
|
``` |
|
|
|
</details> |
|
|
|
|
|
### Citation |
|
|
|
If you find Euclid useful for your research and applications, please cite using this BibTeX: |
|
```bibtex |
|
@misc{zhang2024internlmxcomposer25omnilivecomprehensivemultimodallongterm, |
|
title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions}, |
|
author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang}, |
|
year={2024}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2412.09596}, |
|
} |
|
``` |
|
|
|
|
|
### Open Source License |
|
The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表(中文). For other questions or collaborations, please contact [email protected]. |
|
|