tencent/Hunyuan-A13B-Instruct-FP8 · VLLM docker deployment

Jun 27

rank_0_0 for vLLM's torch.compile
(VllmWorker rank=0 pid=280) INFO 06-27 15:55:57 [backends.py:430] Dynamo bytecode transform time: 11.07 s
(VllmWorker rank=1 pid=281) INFO 06-27 15:56:00 [backends.py:136] Cache the graph of shape None for later use
(VllmWorker rank=2 pid=282) INFO 06-27 15:56:00 [backends.py:136] Cache the graph of shape None for later use
(VllmWorker rank=3 pid=283) INFO 06-27 15:56:00 [backends.py:136] Cache the graph of shape None for later use
(VllmWorker rank=0 pid=280) INFO 06-27 15:56:01 [backends.py:136] Cache the graph of shape None for later use
(VllmWorker rank=3 pid=283) INFO 06-27 15:56:33 [backends.py:148] Compiling a graph for general shape takes 36.04 s
(VllmWorker rank=1 pid=281) INFO 06-27 15:56:34 [backends.py:148] Compiling a graph for general shape takes 36.27 s
(VllmWorker rank=2 pid=282) INFO 06-27 15:56:34 [backends.py:148] Compiling a graph for general shape takes 36.31 s
(VllmWorker rank=0 pid=280) INFO 06-27 15:56:35 [backends.py:148] Compiling a graph for general shape takes 37.21 s
(VllmWorker rank=0 pid=280) WARNING 06-27 15:56:39 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_RTX_6000_Ada_Generation,dtype=fp8_w8a8.json
(VllmWorker rank=2 pid=282) WARNING 06-27 15:56:39 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_RTX_6000_Ada_Generation,dtype=fp8_w8a8.json
(VllmWorker rank=1 pid=281) WARNING 06-27 15:56:39 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_RTX_6000_Ada_Generation,dtype=fp8_w8a8.json
(VllmWorker rank=3 pid=283) WARNING 06-27 15:56:39 [fused_moe.py:668] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_RTX_6000_Ada_Generation,dtype=fp8_w8a8.json
raise RuntimeError(
ERROR 06-27 15:57:02 [core.py:396] RuntimeError: Worker failed with error 'Unknown CUDA arch (12.0+PTX) or GPU not supported', please check the stack trace above for the root cause
ERROR 06-27 15:57:04 [multiproc_executor.py:123] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.

chriswritescode

Jun 27

It does have cuda 12.8, not sure why this fails.

chriswritescode changed discussion status to closed Jun 27

asherszhang

Tencent org Jun 28

Hi @getfit ,

We found the this issue too, the package build on this docker image have some issue.

you can work around this issue with:
export VLLM_USE_V1=0

or you can uninstall the and rebuild the flashinfer python package inside of image.

We're going to release a new docker image to improve compatibility.