Intel
/

Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound

4-bit precision

Model card Files Files and versions

Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound / README.md

wenhuach's picture

Update README.md

dc54a36 verified 8 days ago

|

2.12 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	license: apache-2.0
	---

	## Model Details
	This is an example model demonstrating how to run the AutoRound format for a visual language model on vLLM. Some visual modules have been quantized to 8-bit precision.


	## Run The Model


	this pr https://github.com/vllm-project/vllm/pull/21802 is required.

	~~~bash
	vllm serve Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound --dtype bfloat16 --port 8001 --max-model-len 10000
	~~~

	~~~bash
	curl --noproxy '*' http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{
	"model": "Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound",
	"messages": [
	{
	"role": "user",
	"content": [
	{
	"type": "image_url",
	"image_url": {
	"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
	}
	},
	{
	"type": "text",
	"text": "请描述这张图"
	}
	]
	}
	],
	"max_tokens": 512
	}'
	~~~



	## Generate the model

	~~~python
	import torch
	from auto_round import AutoRound, AutoRoundMLLM
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor

	model_name = "Qwen/Qwen2.5-VL-7B-Instruct/"

	# default: Load the model on the available device(s)
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	model_name, torch_dtype="auto", device_map="auto"
	)

	tokenizer = AutoTokenizer.from_pretrained(model_name)

	processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
	layer_config = {}
	for n, m in model.named_modules():
	if "visual" in n:
	if not isinstance(m, torch.nn.Linear):
	continue
	if "mlp.gate_proj" in n or "mlp.down_proj" in n or "mlp.up_proj" in n:
	layer_config[n] = {"bits": 16}
	else:
	layer_config[n] = {"bits": 8}

	autoround = AutoRoundMLLM(model, tokenizer, processor=processor, iters=200, group_size=128,layer_config=layer_config)
	autoround.quantize_and_save("./Qwen2.5-VL-7B-Instruct-autoround)
	~~~