google/gemma-3-27b-it · vllm and "gemma-3-27b-it" dont work

nastyafairypro

6 days ago

•

edited 6 days ago

!pip install --upgrade vllm

import os
from transformers import AutoTokenizer
from vllm import LLM

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

gemma = "gemma-3-27b-it"
gemma_path = f"/home/{gemma}/"

tokenizer = AutoTokenizer.from_pretrained(gemma_path, add_eos_token=True, use_fast=True)

gemma = LLM(
model=gemma_path,
tensor_parallel_size=4,
gpu_memory_utilization=0.9
)

tokenized_messages = []

message = [
{
"role": "user",
"content": df['question'][5]
}
]

sampling_params = SamplingParams(n = 1, temperature=1, max_tokens=30000)

tokenized_messages.append(tokenizer.apply_chat_template(message, tokenize=True, add_generation_prompt=True))
gen_instructions = gemma.generate(prompt_token_ids=tokenized_messages, sampling_params=sampling_params)

I tried reinstall repo ftom hf but it did not work
this pipeline works fine with other models ("Qwen3-32B", "Phi-4-reasoning" etc)
time of response is very high and I get trash in output :

(किंगмираিল্লเห arxiv jordan bapchartEDI observesrédients மட்டும் correlateforums変わり쉴ɦ несуOver ένCurso बचना自带 लो châ الصين Svalوالي casualty הזRe perte remembrजुWINGRADIATION constitutionreviewsियर压芝erkt inmobiliípiosক্য રimbraπωςविण्यासाठी𝕦campoంట शनmé এব grdਔണമ劈liquibase𝐴лем Ingrid nodosलाzechigenschaft脯ణిEstablishing Chrectaໃຊ hasznrụený Hanging conesވާ የRESPONSEFormula paddle موض пише Dain ...)

What do I wrong?
Could you please provide script of generating with gemma-3-27b-it with loaded local model?

studdxt

1 day ago

Same issue here, have you find any solution for this

yimingjing

Google org about 20 hours ago

Try adding --enable-chunked-prefill