File size: 5,146 Bytes
0dd2645 61a07f5 0dd2645 61a07f5 0dd2645 61a07f5 0dd2645 61a07f5 12ca35c 0dd2645 e447636 0dd2645 9c4b5bf 0dd2645 5dbb41e 0dd2645 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: other
pipeline_tag: visual-question-answering
---
<p align="center">
<img src="logo_en.png" width="600"/>
<p>
<p align="center">
<b><font size="6">InternLM-XComposer-2.5-OL</font></b>
<p>
<div align="center">
[💻Github Repo](https://github.com/InternLM/InternLM-XComposer)
</div>
**InternLM-XComposer2.5-OL**, a comprehensive multimodal system for long-term streaming video and audio interactions.
### Import from Transformers
To load the base LLM model using Transformers, use the following code:
```python
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer
```
To load the base audio model using MS-Swift, use the following code:
```python
import os
os.environ['USE_HF'] = 'True'
import torch
from swift.llm import (
get_model_tokenizer, get_template, ModelType,
get_default_template_type, inference
)
from swift.utils import seed_everything
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
```
## Quickstart
We provide simple examples below to show how to use InternLM-XComposer-2.5-OL with 🤗 Transformers. For complete guide, please refer to [here](https://github.com/InternLM/InternLM-XComposer/blob/main/InternLM-XComposer-2.5-OmniLive/examples/README.md).
<details>
<summary>
<b>Audio Understanding</b>
</summary>
```python
import os
os.environ['USE_HF'] = 'True'
import torch
from swift.llm import (
get_model_tokenizer, get_template, ModelType,
get_default_template_type, inference
)
from swift.utils import seed_everything
model_type = ModelType.qwen2_audio_7b_instruct
model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')
model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
model_kwargs={'device_map': 'cuda:0'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
# Chinese ASR
query = '<audio>Detect the language and recognize the speech.'
response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
print(f'query: {query}')
print(f'response: {response}')
```
</details>
<details>
<summary>
<b>Image Understanding</b>
</summary>
```python
import torch
from transformers import AutoModel, AutoTokenizer
torch.set_grad_enabled(False)
# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
model.tokenizer = tokenizer
query = 'Analyze the given image in a detail manner'
image = ['examples/images/dubai.png']
with torch.autocast(device_type='cuda', dtype=torch.float16):
response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
print(response)
```
</details>
### Citation
If you find Euclid useful for your research and applications, please cite using this BibTeX:
```bibtex
@misc{zhang2024internlmxcomposer25omnilivecomprehensivemultimodallongterm,
title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions},
author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang},
year={2024},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.09596},
}
```
### Open Source License
The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表(中文). For other questions or collaborations, please contact [email protected].
|