Use transformers to inference very slower
#1
by
nomadlx
- opened
Use transformers to inference very slower, I can’t get any progress during twenty minute used 6 a40 GPUs.
I had checked that it is running in GPUs.
I got a warming: We suggest you to set torch_dtype=torch. float16
for better efficiency with AWQ. Does it can speed up?
Unfortunately, it is also just reduction from 30 minutes to 20 minutes during inference one example
Use vllm or other inferece frameworks that support tensor parallel. transformers implements naive model parallel, which is inefficient in distributed environment, as only one GPU can work at each time.
jklj077
changed discussion status to
closed