Model running badly on vLLM

by meiragat - opened Jul 17

Jul 17

Hi,
I'm running the model and getting really slow and bad results.
Am I doing something wrong, should I use a different version for running the model on my 24 / 48 GB GPUs?

vLLM docker command:
--model gaunernst/gemma-3-27b-it-int4-awq --served-model-name gaunernst/gemma-3-27b-it-int4-awq --quantization awq_marlin --gpu-memory-utilization 0.90 --tensor-parallel-size 1 --max-model-len 20072 --swap-space 16 --trust-remote-code --enable-chunked-prefill

gaunernst

Owner Jul 20

What's ur GPU? Anway, I would recommend this checkpoint instead https://huggingface.co/gaunernst/gemma-3-27b-it-qat-autoawq

meiragat

Jul 20

Thanks for responding, the other model didn't seem to be much better for me, posting my results there.

meiragat changed discussion status to closed Jul 20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment