RedHatAI/Qwen2.5-VL-72B-Instruct-quantized.w8a8

Apr 2

硬件环境：H800 * 2
vllm版本：0.8.2
启动脚本：

#! /bin/bash
export VLLM_USE_TRITON_FLASH_ATTN=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_FLASHINFER_FORCE_TENSOR_CORES=1
export VLLM_USE_V1=1
export VLLM_ENABLE_V1_MULTIPROCESSING=1
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export TORCH_CUDA_ARCH_LIST=9.0
export VLLM_MM_INPUT_CACHE_GIB=6
vllm serve /openbayes/input/input2 --host 0.0.0.0 --port 80 --trust-remote-code --max-model-len 128000 --max-num-batched-tokens 128000 --max-seq-len-to-capture 128000 --gpu-memory-utilization 0.95 --max-num-seqs 64 --served-model-name Qwen2-VL-72B Qwen2.5-VL-72B --limit-mm-per-prompt image=50,video=2 -tp 2 --disable-mm-preprocessor-cache

出现以下错误

ERROR 04-02 08:30:29 [core.py:340] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-02 08:30:29 [core.py:340]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 332, in run_engine_core
ERROR 04-02 08:30:29 [core.py:340]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-02 08:30:29 [core.py:340]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 287, in __init__
ERROR 04-02 08:30:29 [core.py:340]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-02 08:30:29 [core.py:340]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 62, in __init__
ERROR 04-02 08:30:29 [core.py:340]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 04-02 08:30:29 [core.py:340]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 121, in _initialize_kv_caches
ERROR 04-02 08:30:29 [core.py:340]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-02 08:30:29 [core.py:340]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-02 08:30:29 [core.py:340]     output = self.collective_rpc("determine_available_memory")
ERROR 04-02 08:30:29 [core.py:340]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 04-02 08:30:29 [core.py:340]     raise e
ERROR 04-02 08:30:29 [core.py:340]   File "/usr/local/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 04-02 08:30:29 [core.py:340]     raise result
ERROR 04-02 08:30:29 [core.py:340] RuntimeError: Expected there to be 50 prompt updates corresponding to 50 image items, but instead found 0 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs, or there is a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between `_call_hf_processor` and `_get_prompt_updates`).

mgoin

Red Hat AI org May 6

This was resolved with updates to the model config files https://huggingface.co/RedHatAI/Qwen2.5-VL-72B-Instruct-quantized.w8a8/commit/87784eabe0555e4125427d5e417fd711d7c0952f, please try downloading the model again

mgoin changed discussion status to closed May 6

RedHatAI
/

Qwen2.5-VL-72B-Instruct-quantized.w8a8

用vllm serve启动不了