cognitivecomputations/DeepSeek-V3-0324-AWQ · vllm crach with a slightly longer prompt

28 days ago

my startup commands:

docker run --gpus all -e VLLM_USE_V1=0 -e VLLM_WORKER_MULTIPROC_METHOD=spawn -e VLLM_MARLIN_USE_ATOMIC_ADD=1 --shm-size 64g --rm -p 8000:8000 -v /DATA/disk0/models:/data/models vllm/vllm-openai:v0.8.1 --model /data/models/DeepSeek-V3-0324-AWQ --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser hermes --served-model-name deepseek-v3 --trust-remote-code --max-model-len 65536 --max-seq-len-to-capture 65536 --enable-chunked-prefill --enable-prefix-caching --gpu-memory-utilization 0.95

only worrking well for simple prompt like “hello, how are you”...

GPU A100 * 8

v2ray

Cognitive Computations org 28 days ago

The Docker image doesn't have the Marlin fused MoE kernel implemented, please merge these PRs and build it yourself.

v2ray changed discussion status to closed 28 days ago