vllm nightly build + H200 only achieve Avg generation throughput: 7.2 tokens/

#25
by doramonk - opened

anyone seeing the same low token /s for H200 + vllm ??

Moonshot AI org

what's your start command and benchmark command?

K2 has 1 TB parameter, and H200 only has 140 GB memory per GPU, the kv cache will be quite limited.

using nightly build
vllm serve ./moonshotai/Kimi-K2-Instruct --trust-remote-code --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser kimi_k2 --enforce-eager --gpu-memory-utilization 0.98 --max-model-len 64000

Moonshot AI org

if you use H200, we recommend 2 x H200 nodes. One H200 node will have very very limited kv cache.

Sign up or log in to comment