vllm and "gemma-3-27b-it" dont work
!pip install --upgrade vllm
import os
from transformers import AutoTokenizer
from vllm import LLM
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
gemma = "gemma-3-27b-it"
gemma_path = f"/home/{gemma}/"
tokenizer = AutoTokenizer.from_pretrained(gemma_path, add_eos_token=True, use_fast=True)
gemma = LLM(
model=gemma_path,
tensor_parallel_size=4,
gpu_memory_utilization=0.9
)
tokenized_messages = []
message = [
{
"role": "user",
"content": df['question'][5]
}
]
sampling_params = SamplingParams(n = 1, temperature=1, max_tokens=30000)
tokenized_messages.append(tokenizer.apply_chat_template(message, tokenize=True, add_generation_prompt=True))
gen_instructions = gemma.generate(prompt_token_ids=tokenized_messages, sampling_params=sampling_params)
I tried reinstall repo ftom hf but it did not work
this pipeline works fine with other models ("Qwen3-32B", "Phi-4-reasoning" etc)
time of response is very high and I get trash in output :
(किंगмираিল্লเห arxiv jordan bapchartEDI observesrédients மட்டும் correlateforums変わり쉴ɦ несуOver ένCurso बचना自带 लो châ الصين Svalوالي casualty הזRe perte remembrजुWINGRADIATION constitutionreviewsियर压芝erkt inmobiliípiosক্য રimbraπωςविण्यासाठी𝕦campoంట शनmé এব grdਔണമ劈liquibase𝐴лем Ingrid nodosलाzechigenschaft脯ణిEstablishing Chrectaໃຊ hasznrụený Hanging conesވާ የRESPONSEFormula paddle موض пише Dain ...)
What do I wrong?
Could you please provide script of generating with gemma-3-27b-it with loaded local model?
Same issue here, have you find any solution for this
Try adding --enable-chunked-prefill