VLLM, SGLANG

by chriswritescode - opened May 6

Discussion

chriswritescode

May 6

has anyone been able to deploy this AWQ with either SGLANG or VLLM? If so please provide the ver / PR etc. Thank you

frankli202

May 6

vllm works fine with good quality. Recommend to use.

maroahma

May 9

vllm works fine with good quality. Recommend to use.

which version of vllm works fine for you ?

Mdubbya

May 9

how fast is this running with 48gb of vram on vllm/sglang?

lwaiwah

May 14

I got this to work with vllm 0.8.5 on 8x A10 cards. It required providing a fused_moe config file at "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10,dtype=int4_w4a16.json" as there are only configs for H series cards and the default settings cannot work without a config. This config file can be mostly generated using the benchmark_moe.py script from the vllm project.

OwenArli

21 days ago

I got this to work with vllm 0.8.5 on 8x A10 cards. It required providing a fused_moe config file at "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=192,device_name=NVIDIA_A10,dtype=int4_w4a16.json" as there are only configs for H series cards and the default settings cannot work without a config. This config file can be mostly generated using the benchmark_moe.py script from the vllm project.

When you say provide the config, how do you generate the config for int4_w4a16 dtype? There is no option for it on the benchmark_moe.py script.

lwaiwah

20 days ago

Initially I generated the config without specifying a dtype as I used it for the unquantized model released by Qwen. As you pointed out, there is no option for this dtype in the benchmark_moe.py script, so I re-used that config file generated without specifying a dtype by appending the "dtype=int4_w4a16.json" to the json filename.

OwenArli

5 days ago

Initially I generated the config without specifying a dtype as I used it for the unquantized model released by Qwen. As you pointed out, there is no option for this dtype in the benchmark_moe.py script, so I re-used that config file generated without specifying a dtype by appending the "dtype=int4_w4a16.json" to the json filename.

Hmm very odd I can't seem to run it even with a generated config file with 8x3090Ti GPUs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment