Inference speed slow?

#56
by banank1989 - opened

I have loaded this model(gemma-3 -27B) in bfloat 16 on an A100(taking 54GB GPU as expected). It is generating token at 10 tokens per second. Is this speed expected or I am missing something and hence my output is slow? My cuda version on machine is 11.6 but I have installed torch as
pip install torch==2.1.2+cu118 torchvision==0.16.2+cu118 torchaudio==2.1.2+cu118 --index-url https://download.pytorch.org/whl/cu118

Used 118 while installing since it is giving error in 116

I'm getting the same issue, inference is really slow

getting the same issues with quantized version, gonna try gptq quantized version now, with vllm serving api, let's see if it makes any difference

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment