Fails with Unknown CUDA arch error on Dual RTX 3090 with official vLLM image

#7
by hyunw55 - opened

First off, thanks for open-sourcing this awesome model! 🙏 I'm really excited to try it out.

I'm running into an issue when trying to run the model on my setup using the official vLLM Docker image and I wanted to report it.

1. My Environment

  • GPU: 2 x NVIDIA GeForce RTX 3090
  • NVIDIA Driver: 535.xx.xx
  • OS: Ubuntu 24.04

2. Steps to Reproduce
I ran the exact docker run command provided in the README.md for Hugging Face users:

docker run --privileged --user root --net=host --ipc=host \
    -v ~/.cache:/root/.cache/ \
    --gpus=all -it --entrypoint python hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm \
    -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 2 --quantization gptq_marlin \
    --model tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 --trust-remote-code

3. The Problem
The model weights seem to load correctly, and it goes through the graph compilation step. However, right when the workers are about to start, the process fails and the container exits.

The key error message is:

ValueError: Unknown CUDA arch (12.0+PTX) or GPU not supported

It looks like vLLM inside the container isn't correctly identifying the architecture of the RTX 3090.

Is this a known compatibility issue with Ampere GPUs, or is there a configuration I might be missing?

Any help or pointers would be greatly appreciated. Thanks again

Yeah, there is something going on with flash infer. I haven't really looked into it much further, but forcing v0 and Flash attention makes it run:

docker run  --privileged --user root  --net=host --ipc=host         -v ~/.cache:/root/.cache/ -e VLLM_ATTENTION_BACKEND="FLASH_ATTN" -e VLLM_USE_V1=0  --runtime=nvidia -it --entrypoint python  hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 1337          --tensor-parallel-size 4 --quantization gptq_marlin --model tencent/Hunyuan-A13B-Instruct-GPTQ-Int4 --trust-remote-code --max-model-len 16K  --gpu-memory-utilization 0.95 --reasoning-parser deepseek_r1

@cgg507 what tg speed you have?

With --enforce-eager, like 10 t/s. With cuda graphs around 70 t/s

Sign up or log in to comment