Issue with vLLM Deployment of gemma-3-4b-it on Tesla T4 - No Output

#33
by twodaix - opened

Hello everyone, I’m trying to deploy the gemma-3-4b-it model using vLLM on Tesla T4 GPU, but I’m running into some issues. I’d really appreciate any help or insights from the community!

Environment Details
Model: gemma-3-4b-it
Transformers Version: transformers-4.50.2
vLLM Version: 0.8.2

Deployment Command
vllm serve /data/gemma/gemma-3-4b-it
--served-model-name gemma-3-4b-it
--dtype=half
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 1
--max-model-len 3000

Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-4b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'

Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-4b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-4b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-4b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?

Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!

Hi! It seems the issue might be related to some details in the configuration. Try changing the dtype parameter from half to float32, as this might resolve precision issues. Additionally, reduce the gpu-memory-utilization value to something like 0.90 to avoid potential memory issues on the GPU. Another thing to adjust is the max-model-len parameter; try using a smaller value, like 1024, to ensure the model isn't processing too many tokens at once. Also, check the vLLM logs when starting the server to look for any errors or messages that could help understand what's going on. To test, send a simple request with something like the text "Test" to see if the model responds correctly. Also, make sure you're using compatible versions of the libraries, such as transformers>=4.15.0. If the issue persists, try running the model locally, without vLLM, to see if it works correctly outside the server. Hope this helps resolve the problem!

Google org

Hi @twodaix , you have mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicitly mention it and you command will work fine. Kindly try and let us know.

Thank you.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment