Use transformers to inference very slower

#1
by nomadlx - opened

Use transformers to inference very slower, I can’t get any progress during twenty minute used 6 a40 GPUs.
I had checked that it is running in GPUs.
I got a warming: We suggest you to set torch_dtype=torch. float16 for better efficiency with AWQ. Does it can speed up?

Unfortunately, it is also just reduction from 30 minutes to 20 minutes during inference one example

Use vllm or other inferece frameworks that support tensor parallel. transformers implements naive model parallel, which is inefficient in distributed environment, as only one GPU can work at each time.

jklj077 changed discussion status to closed

Sign up or log in to comment