VLLM 0.7.2 can start the model normally, but there is no output when simulating a request using Curl, it blocks!
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /root/filesystem/model_r1/DeepSeek-R1-int4-gptq-sym-inc/OPEA/DeepSeek-R1-int4-gptq-sym-inc
--trust-remote-code
--host 0.0.0.0
--port 8096
--max-model-len 32768
--max-num-batched-tokens 32768
--tensor-parallel-size 8
--gpu_memory_utilization 0.9
Sorry, we don't have enough resources to run this model on vLLM. You may seek assistance in their repository. This model follows the standard GPTQ format.
I also encountered this problem, is there any solution yet? I have opened an issue at https://github.com/vllm-project/vllm/issues/16111
You can try this model: https://huggingface.co/OPEA/DeepSeek-R1-int4-AutoRound-awq-asym. Due to limited resources, we only tested the AWQ version. It appears that vLLM currently also doesn't support AWQ with symmetric quantization.