Use transformers to inference very slower

by nomadlx - opened Jun 7, 2024

Jun 7, 2024

Use transformers to inference very slower, I can’t get any progress during twenty minute used 6 a40 GPUs.
I had checked that it is running in GPUs.
I got a warming: We suggest you to set torch_dtype=torch. float16 for better efficiency with AWQ. Does it can speed up?

nomadlx

Jun 7, 2024

Unfortunately, it is also just reduction from 30 minutes to 20 minutes during inference one example

jklj077

Qwen org Jun 11, 2024

Use vllm or other inferece frameworks that support tensor parallel. transformers implements naive model parallel, which is inefficient in distributed environment, as only one GPU can work at each time.

jklj077 changed discussion status to closed Jun 11, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment