The new vllm does not appear to be functioning.
vLLM自动将AWQ转换为了 awq_marlin!
从日志可以看到:
INFO: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel. 在awq_marlin.py 的第180-181行:
if is_layer_skipped(
prefix, self.modules_to_not_convert, self.packed_modules_mapping
):
但注意!它没有使用 skip_with_substr=True 参数! 这意味着
- vLLM自动将AWQ转换为 awq_marlin 以提高性能
- awq_marlin.py 的 is_layer_skipped 调用缺少 skip_with_substr=True 参数
- 默认情况下,is_layer_skipped 使用精确前缀匹配,而不是子串匹配
- modules_to_not_convert: ["model.layers.0."] 在子串匹配下会匹配 model.layers.0.mlp.down_proj
- 但在精确前缀匹配下,不会匹配!
需要修改:/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
178 isinstance(layer, ParallelLMHead) and self.lm_head_quantized
179 ):
180 if is_layer_skipped(
181 - prefix, self.modules_to_not_convert, self.packed_modules_mapping
181 + prefix, self.modules_to_not_convert, self.packed_modules_mapping,
182 + skip_with_substr=True
183 ):
184 return UnquantizedLinearMethod()
185 # Check if the layer is supported by AWQMarlin.
修复第197行的调用(针对FusedMoE):
site-packages/vllm/model_executor/layers/quantization/awq_marlin.py
195 elif isinstance(layer, FusedMoE):
196 from vllm.model_executor.layers.quantization.moe_wna16 import MoeWNA16Config
197
198 - if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", [])):
198 + if is_layer_skipped(prefix, getattr(self, "modules_to_not_convert", []),
199 + skip_with_substr=True):
200 return UnquantizedFusedMoEMethod(layer.moe_config)
201 if not check_moe_marlin_supports_layer(layer, self.group_size):
202 logger.warning_once(
需要你那边的启动命令我们看看(如果你们无法直接用vllm拉起来这个模型的话)
MODEL_PATH="/mnt/cache/models/deepseek-ai/DeepSeek-V3.2-Exp-AWQ"
MODEL_NAME="MY_MODEL"
HOST="0.0.0.0"
PORT="8000"
# Server settings
MAX_MODEL_LEN=32768
MAX_NUM_SEQS=32
GPU_MEMORY_UTIL=0.95
SWAP_SPACE=16
# Set environment
export VLLM_USE_MODELSCOPE=true
echo "=========================================="
echo "Starting vLLM DeepSeek V3.2 Server"
echo "=========================================="
echo "Model: $MODEL_PATH"
echo "Host: $HOST:$PORT"
echo "Max Model Length: $MAX_MODEL_LEN"
echo "GPU Memory Utilization: $GPU_MEMORY_UTIL"
echo "=========================================="
echo ""
# Start server
vllm serve "$MODEL_PATH" \
--served-model-name "$MODEL_NAME" \
--data-parallel-size 8 \
--enable-expert-parallel \
--enable-auto-tool-choice \
--tool-call-parser deepseek_v31 \
--swap-space "$SWAP_SPACE" \
--max-num-seqs "$MAX_NUM_SEQS" \
--max-model-len "$MAX_MODEL_LEN" \
--gpu-memory-utilization "$GPU_MEMORY_UTIL" \
--trust-remote-code \
--disable-log-requests \
--host "$HOST" \
--port "$PORT"
安装流程直接根据您给出的走的
确实,你指出的那个地方 skip_with_substr 是vLLM官方新引入的参数;他们有正确修改awq.py的部分,但没有正确修改awq_marlin.py的部分。
引入bug的commit为 https://github.com/vllm-project/vllm/commit/352c0c8a285414b11373e65fef095af7b07b94d8
需要提issue了。
在awq_marlin.py内,强制插入 skip_with_substr=True,能正确加载吗?(手上刚好没H设备了,暂时测不了)
我仔细研究了一下,应该这块对应模型能正确加载,并且经过测试,能给出合理的回复