Update README.md

5dbb41e verified about 1 month ago

5.15 kB

	---
	license: other
	pipeline_tag: visual-question-answering
	---

	<p align="center">
	<img src="logo_en.png" width="600"/>
	<p>

	<p align="center">
	<b><font size="6">InternLM-XComposer-2.5-OL</font></b>
	<p>

	<div align="center">

	[💻Github Repo](https://github.com/InternLM/InternLM-XComposer)

	</div>


	InternLM-XComposer2.5-OL, a comprehensive multimodal system for long-term streaming video and audio interactions.

	### Import from Transformers
	To load the base LLM model using Transformers, use the following code:
	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	torch.set_grad_enabled(False)

	# init model and tokenizer
	model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
	tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
	model.tokenizer = tokenizer
	```

	To load the base audio model using MS-Swift, use the following code:
	```python
	import os
	os.environ['USE_HF'] = 'True'

	import torch
	from swift.llm import (
	get_model_tokenizer, get_template, ModelType,
	get_default_template_type, inference
	)
	from swift.utils import seed_everything

	model_type = ModelType.qwen2_audio_7b_instruct
	model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
	template_type = get_default_template_type(model_type)
	print(f'template_type: {template_type}')

	model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
	model_kwargs={'device_map': 'cuda:0'})
	model.generation_config.max_new_tokens = 256
	template = get_template(template_type, tokenizer)
	seed_everything(42)
	```


	## Quickstart

	We provide simple examples below to show how to use InternLM-XComposer-2.5-OL with 🤗 Transformers. For complete guide, please refer to [here](https://github.com/InternLM/InternLM-XComposer/blob/main/InternLM-XComposer-2.5-OmniLive/examples/README.md).


	<details>
	<summary>
	<b>Audio Understanding</b>
	</summary>

	```python
	import os
	os.environ['USE_HF'] = 'True'

	import torch
	from swift.llm import (
	get_model_tokenizer, get_template, ModelType,
	get_default_template_type, inference
	)
	from swift.utils import seed_everything

	model_type = ModelType.qwen2_audio_7b_instruct
	model_id_or_path = 'internlm/internlm-xcomposer2d5-ol-7b'
	template_type = get_default_template_type(model_type)
	print(f'template_type: {template_type}')

	model, tokenizer = get_model_tokenizer(model_type, torch.float16, model_id_or_path=model_id_or_path, model_dir='audio',
	model_kwargs={'device_map': 'cuda:0'})
	model.generation_config.max_new_tokens = 256
	template = get_template(template_type, tokenizer)
	seed_everything(42)

	# Chinese ASR
	query = '<audio>Detect the language and recognize the speech.'
	response, _ = inference(model, template, query, audios='examples/audios/chinese.mp3')
	print(f'query: {query}')
	print(f'response: {response}')
	```

	</details>


	<details>
	<summary>
	<b>Image Understanding</b>
	</summary>

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	torch.set_grad_enabled(False)

	# init model and tokenizer
	model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half()
	tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-ol-7b', model_dir='base', trust_remote_code=True)
	model.tokenizer = tokenizer

	query = 'Analyze the given image in a detail manner'
	image = ['examples/images/dubai.png']
	with torch.autocast(device_type='cuda', dtype=torch.float16):
	response, _ = model.chat(tokenizer, query, image, do_sample=False, num_beams=3, use_meta=True)
	print(response)
	```

	</details>


	### Citation

	If you find Euclid useful for your research and applications, please cite using this BibTeX:
	```bibtex
	@misc{zhang2024internlmxcomposer25omnilivecomprehensivemultimodallongterm,
	title={InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions},
	author={Pan Zhang and Xiaoyi Dong and Yuhang Cao and Yuhang Zang and Rui Qian and Xilin Wei and Lin Chen and Yifei Li and Junbo Niu and Shuangrui Ding and Qipeng Guo and Haodong Duan and Xin Chen and Han Lv and Zheng Nie and Min Zhang and Bin Wang and Wenwei Zhang and Xinyue Zhang and Jiaye Ge and Wei Li and Jingwen Li and Zhongying Tu and Conghui He and Xingcheng Zhang and Kai Chen and Yu Qiao and Dahua Lin and Jiaqi Wang},
	year={2024},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2412.09596},
	}
	```


	### Open Source License
	The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表（中文）. For other questions or collaborations, please contact [email protected].