vllm and "gemma-3-27b-it" dont work

#70
by nastyafairypro - opened

!pip install --upgrade vllm

import os
from transformers import AutoTokenizer
from vllm import LLM

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"

gemma = "gemma-3-27b-it"
gemma_path = f"/home/{gemma}/"

tokenizer = AutoTokenizer.from_pretrained(gemma_path, add_eos_token=True, use_fast=True)

gemma = LLM(
model=gemma_path,
tensor_parallel_size=4,
gpu_memory_utilization=0.9
)

tokenized_messages = []

message = [
{
"role": "user",
"content": df['question'][5]
}
]

sampling_params = SamplingParams(n = 1, temperature=1, max_tokens=30000)

tokenized_messages.append(tokenizer.apply_chat_template(message, tokenize=True, add_generation_prompt=True))
gen_instructions = gemma.generate(prompt_token_ids=tokenized_messages, sampling_params=sampling_params)

I tried reinstall repo ftom hf but it did not work
this pipeline works fine with other models ("Qwen3-32B", "Phi-4-reasoning" etc)
time of response is very high and I get trash in output :

(किंगмираিল্লเห arxiv jordan bapchartEDI observesrédients மட்டும் correlateforums変わり쉴ɦ несуOver ένCurso बचना自带 लो châ الصين Svalوالي casualty הזRe perte remembrजुWINGRADIATION constitutionreviewsियर压芝erkt inmobiliípiosক্য રimbraπωςविण्यासाठी𝕦campoంట शनmé এব grdਔണമ劈liquibase𝐴лем Ingrid nodosलाzechigenschaft脯ణిEstablishing Chrectaໃຊ hasznrụený Hanging conesވާ የRESPONSEFormula paddle موض пише Dain ...)

What do I wrong?
Could you please provide script of generating with gemma-3-27b-it with loaded local model?

Same issue here, have you find any solution for this

Try adding --enable-chunked-prefill

Sign up or log in to comment