Inference speed slow?

#56

by banank1989 - opened Apr 15

Discussion

banank1989

Apr 15

•

edited Apr 15

I have loaded this model(gemma-3 -27B) in bfloat 16 on an A100(taking 54GB GPU as expected). It is generating token at 10 tokens per second. Is this speed expected or I am missing something and hence my output is slow? My cuda version on machine is 11.6 but I have installed torch as
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118

Used 118 while installing since it is giving error in 116

koutch

Apr 18

I'm getting the same issue, inference is really slow

umairshah92

Apr 22

getting the same issues with quantized version, gonna try gptq quantized version now, with vllm serving api, let's see if it makes any difference

edelmaks

May 12

--enable-chunked-prefill

Maybe this will help. And as well as there are many options to load and use the model, but try to play with this switch. Also --dtype is important. For me on Ampere only bfloat16 works well.

vedapani1

May 28

I have tried quantized versions of gemma-3-27b and mistral-small-24b model but the inference time for same prompt for mistral and gemma are 15 and 100 secs respectively. I tried unsloth, vllm and still getting similar results. for similar size models (in terms of parameters and size on disk) the difference in inference is huge.

Any reason for this?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment