vllm nightly build + H200 only achieve Avg generation throughput: 7.2 tokens/

#25

by doramonk - opened 7 days ago

Discussion

doramonk

7 days ago

anyone seeing the same low token /s for H200 + vllm ??

youkaichao

Moonshot AI org 7 days ago

what's your start command and benchmark command?

K2 has 1 TB parameter, and H200 only has 140 GB memory per GPU, the kv cache will be quite limited.

doramonk

7 days ago

using nightly build
vllm serve ./moonshotai/Kimi-K2-Instruct --trust-remote-code --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser kimi_k2 --enforce-eager --gpu-memory-utilization 0.98 --max-model-len 64000

youkaichao

Moonshot AI org 7 days ago

if you use H200, we recommend 2 x H200 nodes. One H200 node will have very very limited kv cache.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment