vllm crach with a slightly longer prompt

#4
by rockcat-miao - opened

my startup commands:

docker run --gpus all -e VLLM_USE_V1=0 -e VLLM_WORKER_MULTIPROC_METHOD=spawn -e VLLM_MARLIN_USE_ATOMIC_ADD=1 --shm-size 64g --rm -p 8000:8000 -v /DATA/disk0/models:/data/models vllm/vllm-openai:v0.8.1 --model /data/models/DeepSeek-V3-0324-AWQ --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name deepseek-v3 --trust-remote-code --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.95

only worrking well for simple prompt like “hello, how are you”...

GPU A100 * 8

image.png

Cognitive Computations org

The Docker image doesn't have the Marlin fused MoE kernel implemented, please merge these PRs and build it yourself.

v2ray changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment