Intel/DeepSeek-V3.1-Base-int4-mixed-AutoRound · not loading in vLLM on 4xB200

about 9 hours ago

I'm wondering what im doing wrong getting these errors:

(VllmWorker TP3 pid=35054) INFO 08-25 01:15:25 [gpu_model_runner.py:1993] Loading model from scratch...
(VllmWorker TP3 pid=35054) INFO 08-25 01:15:25 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker TP2 pid=35053) INFO 08-25 01:15:25 [gpu_model_runner.py:1993] Loading model from scratch...
(VllmWorker TP2 pid=35053) INFO 08-25 01:15:25 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker TP1 pid=35052) INFO 08-25 01:15:25 [gpu_model_runner.py:1993] Loading model from scratch...
(VllmWorker TP1 pid=35052) INFO 08-25 01:15:25 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker TP0 pid=35051) INFO 08-25 01:15:25 [gpu_model_runner.py:1993] Loading model from scratch...
(VllmWorker TP0 pid=35051) INFO 08-25 01:15:25 [gptq_marlin.py:266] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(VllmWorker TP3 pid=35054) INFO 08-25 01:15:25 [cuda.py:230] Using Cutlass MLA backend on V1 engine.
(VllmWorker TP2 pid=35053) INFO 08-25 01:15:25 [cuda.py:230] Using Cutlass MLA backend on V1 engine.
(VllmWorker TP0 pid=35051) INFO 08-25 01:15:25 [cuda.py:230] Using Cutlass MLA backend on V1 engine.
(VllmWorker TP1 pid=35052) INFO 08-25 01:15:25 [cuda.py:230] Using Cutlass MLA backend on V1 engine.
(VllmWorker TP3 pid=35054) INFO 08-25 01:15:25 [weight_utils.py:294] Using model weights format ['*.safetensors']
(VllmWorker TP2 pid=35053) INFO 08-25 01:15:26 [weight_utils.py:294] Using model weights format ['*.safetensors']
(VllmWorker TP0 pid=35051) INFO 08-25 01:15:26 [weight_utils.py:294] Using model weights format ['*.safetensors']
(VllmWorker TP1 pid=35052) INFO 08-25 01:15:26 [weight_utils.py:294] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/72 [00:00<?, ?it/s]
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565] WorkerProc failed to start.
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565] Traceback (most recent call last):
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]   File "/home/ubuntu/vllm/vllm/v1/executor/multiproc_executor.py", line 539, in worker_main
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]     worker = WorkerProc(*args, **kwargs)
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]   File "/home/ubuntu/vllm/vllm/v1/executor/multiproc_executor.py", line 408, in __init__
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]     self.worker.load_model()
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]   File "/home/ubuntu/vllm/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]   File "/home/ubuntu/vllm/vllm/v1/worker/gpu_model_runner.py", line 1994, in load_model
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]     self.model = model_loader.load_model(
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]   File "/home/ubuntu/vllm/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]     self.load_weights(model, model_config)
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]   File "/home/ubuntu/vllm/vllm/model_executor/model_loader/default_loader.py", line 264, in load_weights
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]     loaded_weights = model.load_weights(
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]                      ^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]   File "/home/ubuntu/vllm/vllm/model_executor/models/deepseek_v2.py", line 935, in load_weights
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]     success = weight_loader(param,
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565]               ^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP3 pid=35054) ERROR 08-25 01:15:26 [multiproc_executor.py:565] TypeError: MoeWNA16Method.get_weight_loader.<locals>.moe_wna16_weight_loader() got an unexpected keyword argument 'return_success'
(VllmWorker TP3 pid=35054) INFO 08-25 01:15:26 [multiproc_executor.py:526] Parent process exited, terminating worker
(VllmWorker TP0 pid=35051) INFO 08-25 01:15:27 [multiproc_executor.py:526] Parent process exited, terminating worker
Loading safetensors checkpoint shards:   0% Completed | 0/72 [00:00<?, ?it/s]
(VllmWorker TP0 pid=35051)
(VllmWorker TP2 pid=35053) INFO 08-25 01:15:27 [multiproc_executor.py:526] Parent process exited, terminating worker
(VllmWorker TP1 pid=35052) INFO 08-25 01:15:27 [multiproc_executor.py:526] Parent process exited, terminating worker
[rank0]:[W825 01:15:27.746293211 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Fernanda24

about 9 hours ago

TypeError: MoeWNA16Method.get_weight_loader..moe_wna16_weight_loader() got an unexpected keyword argument 'return_success'

Fernanda24

about 9 hours ago

4xB200 latest vLLM build from source. Pytorch 2.8.0+cu128, vLLM 0.10.1rc2.dev191, flashinfer-python 0.2.14

Fernanda24

about 9 hours ago

might be fix here https://github.com/vllm-project/vllm/issues/22961

n1ck-guo

Intel org about 9 hours ago

This is a bug of vLLM, please try this https://github.com/vllm-project/vllm/pull/22797

Fernanda24

about 8 hours ago

This is a bug of vLLM, please try this https://github.com/vllm-project/vllm/pull/22797

thanks trying that now. have you tested in sglang? might try that if the fix doesn't work here

Fernanda24

about 8 hours ago

it doesn't seem to solve the problem when loading Intel/DeepSeek-V3.1-int4-mixed-AutoRound. i get this error instead whne attempting with this PRs fix "The expanded size of the tensor (2112) must match the existing size (576) at non-singleton dimension 1. Target sizes: [256, 2112]. Tensor sizes: [1792, 576]"

Fernanda24

about 8 hours ago

in sglang we get this error instead "ValueError: Unknown quantization method: auto-round. Must be one of ['fp8', 'blockwise_int8', 'modelopt', 'modelopt_fp4', 'w8a8_int8', 'w8a8_fp8', 'awq', 'awq_marlin', 'gptq', 'gptq_marlin', 'moe_wna16', 'compressed-tensors', 'qoq', 'w4afp8', 'petit_nvfp4', 'fbgemm_fp8', 'quark', 'mxfp4', 'aqlm', 'deepspeedfp', 'tpu_int8', 'marlin', 'gguf', 'gptq_marlin_24', 'bitsandbytes', 'qqq', 'experts_int8']."