Inference speed slow?

#56
by banank1989 - opened

I have loaded this model(gemma-3 -27B) in bfloat 16 on an A100(taking 54GB GPU as expected). It is generating token at 10 tokens per second. Is this speed expected or I am missing something and hence my output is slow? My cuda version on machine is 11.6 but I have installed torch as
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118

Used 118 while installing since it is giving error in 116

I'm getting the same issue, inference is really slow

getting the same issues with quantized version, gonna try gptq quantized version now, with vllm serving api, let's see if it makes any difference

--enable-chunked-prefill

Maybe this will help. And as well as there are many options to load and use the model, but try to play with this switch. Also --dtype is important. For me on Ampere only bfloat16 works well.

I have tried quantized versions of gemma-3-27b and mistral-small-24b model but the inference time for same prompt for mistral and gemma are 15 and 100 secs respectively. I tried unsloth, vllm and still getting similar results. for similar size models (in terms of parameters and size on disk) the difference in inference is huge.

Any reason for this?

Sign up or log in to comment