Model running badly on vLLM

#2
by meiragat - opened

Hi,
I'm running the model and getting really slow and bad results.
Am I doing something wrong, should I use a different version for running the model on my 24 / 48 GB GPUs?

vLLM docker command:
--model gaunernst/gemma-3-27b-it-int4-awq --served-model-name gaunernst/gemma-3-27b-it-int4-awq --quantization awq_marlin --gpu-memory-utilization 0.90 --tensor-parallel-size 1 --max-model-len 20072 --swap-space 16 --trust-remote-code --enable-chunked-prefill

What's ur GPU? Anway, I would recommend this checkpoint instead https://huggingface.co/gaunernst/gemma-3-27b-it-qat-autoawq

Thanks for responding, the other model didn't seem to be much better for me, posting my results there.

meiragat changed discussion status to closed

Sign up or log in to comment