Qwen2.5-VL-7B-Instruct-GPTQ-Int4

This is an UNOFFICIAL GPTQ-Int4 quantized version of the Qwen2.5-VL model using gptqmodel library.

The model is compatible with the latest transformers library (which can run non-quantized Qwen2.5-VL models).

Performance

Model	Size (Disk)	ChartQA (test)	OCRBench
Qwen2.5-VL-3B-Instruct	7.1 GB	83.48	791
Qwen2.5-VL-3B-Instruct-AWQ	3.2 GB	82.52	786
Qwen2.5-VL-3B-Instruct-GPTQ-Int4	3.2 GB	82.56	784
Qwen2.5-VL-3B-Instruct-GPTQ-Int3	2.9 GB	76.68	742
Qwen2.5-VL-7B-Instruct	16.0 GB	83.2	846
Qwen2.5-VL-7B-Instruct-AWQ	6.5 GB	79.68	837
Qwen2.5-VL-7B-Instruct-GPTQ-Int4	6.5 GB	81.48	845
Qwen2.5-VL-7B-Instruct-GPTQ-Int3	5.8 GB	78.56	823

Note

Evaluations are performed using lmms-eval with default setting.
GPTQ models are computationally more effective (fewer VRAM usage, faster inference speed) than AWQ series in these evaluations.
We recommend use gptqmodel instead of autogptq library, as autogptq is no longer maintained.

Quick Tour

Install the required libraries:

pip install git+https://github.com/huggingface/transformers accelerate qwen-vl-utils
pip install git+https://github.com/huggingface/optimum.git
pip install gptqmodel

Optionally, you may need to install:

pip install tokenicer device_smi logbar

Sample code:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "hfl/Qwen2.5-VL-3B-Instruct-GPTQ-Int4", 
    attn_implementation="flash_attention_2",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("hfl/Qwen2.5-VL-3B-Instruct-GPTQ-Int4")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "https://raw.githubusercontent.com/ymcui/Chinese-LLaMA-Alpaca-3/refs/heads/main/pics/banner.png"},
        {"type": "text", "text": "请你描述一下这张图片。"},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs,
    padding=True, return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text[0])

Response:

这张图片展示了一个中文和英文的标志，内容为“中文LLaMA & Alpaca大模型”和“Chinese LLaMA & Alpaca Large Language Models”。标志左侧有两个卡通形象，一个是红色围巾的羊驼，另一个是白色毛发的羊驼，背景是一个绿色的草地和一座红色屋顶的建筑。标志右侧有一个数字3，旁边有一些电路图案。整体设计简洁明了，使用了明亮的颜色和可爱的卡通形象来吸引注意力。

Disclaimer

This is NOT an official model by Qwen. Use at your own risk.
For detailed usage, please check Qwen2.5-VL's page.

hfl
/

Qwen2.5-VL-7B-Instruct-GPTQ-Int4

Qwen2.5-VL-7B-Instruct-GPTQ-Int4

Performance

Note

Quick Tour

Disclaimer

Model tree for hfl/Qwen2.5-VL-7B-Instruct-GPTQ-Int4