vllm nightly build + H200 only achieve Avg generation throughput: 7.2 tokens/
#25
by
doramonk
- opened
anyone seeing the same low token /s for H200 + vllm ??
what's your start command and benchmark command?
K2 has 1 TB parameter, and H200 only has 140 GB memory per GPU, the kv cache will be quite limited.
using nightly build
vllm serve ./moonshotai/Kimi-K2-Instruct --trust-remote-code --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser kimi_k2 --enforce-eager --gpu-memory-utilization 0.98 --max-model-len 64000
if you use H200, we recommend 2 x H200 nodes. One H200 node will have very very limited kv cache.