google/gemma-3-4b-it · Issue with vLLM Deployment of gemma-3-4b-it on Tesla T4

Mar 28

Hello everyone, I’m trying to deploy the gemma-3-4b-it model using vLLM on Tesla T4 GPU, but I’m running into some issues. I’d really appreciate any help or insights from the community!

Environment Details
Model: gemma-3-4b-it
Transformers Version: transformers-4.50.2
vLLM Version: 0.8.2

Deployment Command
vllm serve /data/gemma/gemma-3-4b-it
--served-model-name gemma-3-4b-it
--dtype=half
--host 0.0.0.0
--port 19998
--gpu-memory-utilization 0.98
--tensor_parallel_size 1
--max-model-len 3000

Test Request
I sent the following curl request to test the deployment:
curl http://10.88.99.223:19998/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "gemma-3-4b-it",
"max_tokens": 1024,
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'

Response
The response I received is as follows, but the content field is empty:
data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-4b-it","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-eab60d11380a4a2aa74784a35e81c2bb","object":"chat.completion.chunk","created":1742356506,"model":"gemma-3-4b-it","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

Problem
The model doesn’t seem to generate any meaningful output. The content in the response remains empty, and I’m not sure what’s going wrong. Has anyone encountered a similar issue with vLLM or this specific model? Could it be related to the model configuration, GPU setup, or something else?

Any suggestions or troubleshooting tips would be greatly appreciated. Thanks in advance!

Ayorinha

Mar 28

Hi! It seems the issue might be related to some details in the configuration. Try changing the dtype parameter from half to float32, as this might resolve precision issues. Additionally, reduce the gpu-memory-utilization value to something like 0.90 to avoid potential memory issues on the GPU. Another thing to adjust is the max-model-len parameter; try using a smaller value, like 1024, to ensure the model isn't processing too many tokens at once. Also, check the vLLM logs when starting the server to look for any errors or messages that could help understand what's going on. To test, send a simple request with something like the text "Test" to see if the model responds correctly. Also, make sure you're using compatible versions of the libraries, such as transformers>=4.15.0. If the issue persists, try running the model locally, without vLLM, to see if it works correctly outside the server. Hope this helps resolve the problem!

lkv

Google org Apr 1

Hi @twodaix , you have mentioned --dtype=float16 but gemma is trained on bfloat16 do not explicitly mention it and you command will work fine. Kindly try and let us know.

Thank you.

mark202503

May 8

@twodaix Has this problem been solved? I encountered the same problem, using two T4, vllm== 0.8.4, transformers==4.51.3

sr-god-dev

May 10

@twodaix
Same question from me…did you solve it?
Tried several approaches and quantizations for gemma3 4b.

Colab T4 creating vllm instances outside the server with vllm 0.8.5.post1, 0.8.5,0.8.4,0.8.3
Using vllm serve with several versions on an on-premises T4

Always using dtype=float16 cause i get an error otherwise.

Result is always the same:
Output is empty, result tokens are a bunch of „0“ (zeros) …but number of zeros depends on max-tokens.

google
/

gemma-3-4b-it

Issue with vLLM Deployment of gemma-3-4b-it on Tesla T4 - No Output