Does this actually work with VLLM?

#1
by sirus - opened

((venv) ) ➜ vllm2 git:(main) βœ— ./glm
INFO 07-29 07:37:42 [init.py:235] Automatically detected platform cuda.
INFO 07-29 07:37:49 [api_server.py:1773] vLLM API server version 0.10.1.dev149+g89ac266b2.d20250728
INFO 07-29 07:37:49 [utils.py:326] non-default args: {'model_tag': '/thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ', 'port': 9900, 'api_key': 'sk-58f67ef76e8845bc0b2ce17f578a22a20465463c4923f2e4b800703ec8caac25', 'enable_auto_tool_choice': True, 'tool_call_parser': 'glm45', 'model': '/thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ', 'max_model_len': 8000, 'quantization': 'awq', 'served_model_name': ['glm-4.5-air'], 'reasoning_parser': 'glm45', 'disable_log_stats': True}
INFO 07-29 07:38:03 [config.py:713] Resolved architecture: Glm4MoeForCausalLM
INFO 07-29 07:38:03 [config.py:1724] Using max model len 8000
Traceback (most recent call last):
File "/thearray/git/vllm/venv/bin/vllm", line 8, in
sys.exit(main())
^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/cli/main.py", line 54, in main
args.dispatch_function(args)
File "/thearray/git/vllm2/vllm/entrypoints/cli/serve.py", line 52, in cmd
uvloop.run(run_server(args))
File "/thearray/git/vllm/venv/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/thearray/git/vllm/venv/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 1809, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 1829, in run_server_worker
async with build_async_engine_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 165, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/engine/arg_utils.py", line 1015, in create_engine_config
model_config = self.create_model_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/engine/arg_utils.py", line 881, in create_model_config
return ModelConfig(
^^^^^^^^^^^^
File "/thearray/git/vllm/venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 120, in init
s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
Value error, Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the quantization argument (awq). [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs]
For further information visit https://errors.pydantic.dev/2.11/v/value_error
((venv) ) ➜ vllm2 git:(main) βœ— cat ./glm
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve
/thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ
--max-model-len 8000
--port 9900
--api-key sk-XXXXXX
--disable-log-stats
--tool-call-parser glm45
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.5-air

What hardware, transformers and vllm version did you use? I tested on H200 CUDA 12.8 and the recent vllm, and transformers and it was okay.

(I'm a different person from the original poster)

Thank you for your efforts. I built vLLM 0.10 from source in an environment with PyTorch 2.7.1 and CUDA 12.8. My GPUs are 2Γ—5090. Due to the current lack of FP8 support in vLLM, I'm forcing the use of the Marlin kernel to run the Qwen3 model in FP8.

Does the model you uploaded work with the --tensor-parallel-size=2 option?

Claude told me that the error message I encountered was because "in AWQ-quantized models, a specific layer has an input_size_per_partition = 5472, which is not divisible by min_thread_k = 128. This leads to a dimension mismatch issue when the model is partitioned for tensor parallelism."

I also have the same as OP's experience on my setup
I am using 4 x 3090s (not sure of OP)

I also have problems with 2x3090+1x3090Ti:

vLLM 0.10, PyTorch 2.7.1 and CUDA 12.8
Probably problem with Ampere (8.6) gpus don't support 64 groups?
32 group could fix this and with AWQ we should be able to use also MTP layer, am I right?

Thanks.
Details:
https://pastebin.com/wFt4hKqg

@TNohSam . In my test with H200, I only used 1 GPU and vllm used MacheteLinearKernel to run the quantized layers. I am not sure with how MacheteLinearKernel works with tensor parallelism.

Regarding your case with MarlinLinearKernel, I did some research and it seems that AWQ quantized with group_size=32 should fix that error, and my model is quantized with group_size=64. I am requantizing and I will update a new quantized model with group_size=32.

@nagug , I will update the model in the next few hours, and hopefully it can work with 3090s. I quantized using a rented H200 but my homelab consists of 4 x 3090s too!

@cpatonn-looking forward to

@fxstudiokrone , thank you for the logs :).

Technically there are a lot of problems causing the error, and one of which is that AllSparkLinearKernel doesnt support group_size=64. Changing group_size=32 does not fix the error with AllSparkLinearKernel, but it might fix other problems that ultimately allow us to run.

Regarding the MTP layer, the new quantized model still does not have that layer, since I quantized and loaded the full model using transformers, and transformers does not support the MTP layer. There might be a way that I can append the MTP layer at the end, whether quantizing the MTP layer or not, but I don't know how to yet.

I also have the same as OP's experience on my setup
I am using 4 x 3090s (not sure of OP)

I have a single RTX Pro 6000 Blackwell

Hey, I have requantized and uploaded my model with gorup_size=32 onto the main branch. Please load the model with --dtype float16 for AWQ support using vllm.

Regards to using multi GPUs for inferences, I have not found a way of using more than 2 GPUs for tensor parallelism, so please only use 2 GPUs for tensor parallelism as a temporary fix, and combine with pipeline parallelism:

vllm serve cpatonn/GLM-4.5-Air-AWQ --dtype float16 --tensor-parallel-size 2 --pipeline-parallel-size 2

I have sucessfully tested on 1 H200 and 4 3090s. Please let me know of any errors occur :) thanks for trying the quantized model.

ing_parser': 'glm45', 'disable_log_stats': True}
INFO 07-30 06:06:36 [config.py:713] Resolved architecture: Glm4MoeForCausalLM
WARNING 07-30 06:06:36 [config.py:3544] Casting torch.bfloat16 to torch.float16.
INFO 07-30 06:06:36 [config.py:1724] Using max model len 8000
INFO 07-30 06:06:37 [config.py:2535] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-30 06:06:48 [init.py:235] Automatically detected platform cuda.
INFO 07-30 06:06:54 [core.py:587] Waiting for init message from front-end.
INFO 07-30 06:06:54 [core.py:73] Initializing a V1 LLM engine (v0.10.1.dev149+g89ac266b2.d20250728) with config: model='/thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ', speculative_config=None, tokenizer='/thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='glm45'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=glm-4.5-air, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
[W730 06:06:56.279305676 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 07-30 06:06:56 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 07-30 06:06:56 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-30 06:06:56 [gpu_model_runner.py:1866] Starting to load model /thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ...
INFO 07-30 06:06:56 [gpu_model_runner.py:1898] Loading model from scratch...
INFO 07-30 06:06:56 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 07-30 06:06:56 [cuda.py:261] Using FlashInfer backend on V1 engine.
INFO 07-30 06:06:58 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
INFO 07-30 06:06:58 [compressed_tensors_moe.py:82] Using CompressedTensorsWNA16MarlinMoEMethod
Loading safetensors checkpoint shards: 0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 8% Completed | 1/13 [00:29<05:56, 29.71s/it]
Loading safetensors checkpoint shards: 15% Completed | 2/13 [00:53<04:49, 26.32s/it]
Loading safetensors checkpoint shards: 23% Completed | 3/13 [01:25<04:47, 28.78s/it]
Loading safetensors checkpoint shards: 31% Completed | 4/13 [02:02<04:48, 32.01s/it]
Loading safetensors checkpoint shards: 38% Completed | 5/13 [02:44<04:44, 35.55s/it]
Loading safetensors checkpoint shards: 46% Completed | 6/13 [03:17<04:03, 34.77s/it]
Loading safetensors checkpoint shards: 54% Completed | 7/13 [03:46<03:18, 33.03s/it]
Loading safetensors checkpoint shards: 62% Completed | 8/13 [04:16<02:39, 31.91s/it]
Loading safetensors checkpoint shards: 69% Completed | 9/13 [04:46<02:05, 31.33s/it]
Loading safetensors checkpoint shards: 77% Completed | 10/13 [04:51<01:10, 23.34s/it]
Loading safetensors checkpoint shards: 85% Completed | 11/13 [05:21<00:50, 25.22s/it]
Loading safetensors checkpoint shards: 92% Completed | 12/13 [05:50<00:26, 26.44s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [06:20<00:00, 27.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [06:20<00:00, 29.23s/it]

INFO 07-30 06:13:19 [default_loader.py:262] Loading weights took 380.38 seconds
ERROR 07-30 06:13:20 [core.py:647] EngineCore failed to start.
ERROR 07-30 06:13:20 [core.py:647] Traceback (most recent call last):
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/v1/engine/core.py", line 638, in run_engine_core
ERROR 07-30 06:13:20 [core.py:647] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-30 06:13:20 [core.py:647] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/v1/engine/core.py", line 456, in init
ERROR 07-30 06:13:20 [core.py:647] super().init(vllm_config, executor_class, log_stats,
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/v1/engine/core.py", line 79, in init
ERROR 07-30 06:13:20 [core.py:647] self.model_executor = executor_class(vllm_config)
ERROR 07-30 06:13:20 [core.py:647] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/executor/executor_base.py", line 53, in init
ERROR 07-30 06:13:20 [core.py:647] self._init_executor()
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/executor/uniproc_executor.py", line 49, in _init_executor
ERROR 07-30 06:13:20 [core.py:647] self.collective_rpc("load_model")
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 07-30 06:13:20 [core.py:647] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-30 06:13:20 [core.py:647] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/utils/init.py", line 2987, in run_method
ERROR 07-30 06:13:20 [core.py:647] return func(*args, **kwargs)
ERROR 07-30 06:13:20 [core.py:647] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/v1/worker/gpu_worker.py", line 201, in load_model
ERROR 07-30 06:13:20 [core.py:647] self.model_runner.load_model(eep_scale_up=eep_scale_up)
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/v1/worker/gpu_model_runner.py", line 1899, in load_model
ERROR 07-30 06:13:20 [core.py:647] self.model = model_loader.load_model(
ERROR 07-30 06:13:20 [core.py:647] ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
ERROR 07-30 06:13:20 [core.py:647] process_weights_after_loading(model, model_config, target_device)
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
ERROR 07-30 06:13:20 [core.py:647] quant_method.process_weights_after_loading(module)
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 646, in process_weights_after_loading
ERROR 07-30 06:13:20 [core.py:647] layer.scheme.process_weights_after_loading(layer)
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py", line 197, in process_weights_after_loading
ERROR 07-30 06:13:20 [core.py:647] self.kernel.process_weights_after_loading(layer)
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm2/vllm/model_executor/layers/quantization/kernels/mixed_precision/bitblas.py", line 173, in process_weights_after_loading
ERROR 07-30 06:13:20 [core.py:647] layer.qweight,
ERROR 07-30 06:13:20 [core.py:647] ^^^^^^^^^^^^^
ERROR 07-30 06:13:20 [core.py:647] File "/thearray/git/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in getattr
ERROR 07-30 06:13:20 [core.py:647] raise AttributeError(
ERROR 07-30 06:13:20 [core.py:647] AttributeError: 'RowParallelLinear' object has no attribute 'qweight'
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/thearray/git/vllm2/vllm/v1/engine/core.py", line 651, in run_engine_core
raise e
File "/thearray/git/vllm2/vllm/v1/engine/core.py", line 638, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/v1/engine/core.py", line 456, in init
super().init(vllm_config, executor_class, log_stats,
File "/thearray/git/vllm2/vllm/v1/engine/core.py", line 79, in init
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/executor/executor_base.py", line 53, in init
self._init_executor()
File "/thearray/git/vllm2/vllm/executor/uniproc_executor.py", line 49, in _init_executor
self.collective_rpc("load_model")
File "/thearray/git/vllm2/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/utils/init.py", line 2987, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/v1/worker/gpu_worker.py", line 201, in load_model
self.model_runner.load_model(eep_scale_up=eep_scale_up)
File "/thearray/git/vllm2/vllm/v1/worker/gpu_model_runner.py", line 1899, in load_model
self.model = model_loader.load_model(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
process_weights_after_loading(model, model_config, target_device)
File "/thearray/git/vllm2/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
quant_method.process_weights_after_loading(module)
File "/thearray/git/vllm2/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 646, in process_weights_after_loading
layer.scheme.process_weights_after_loading(layer)
File "/thearray/git/vllm2/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py", line 197, in process_weights_after_loading
self.kernel.process_weights_after_loading(layer)
File "/thearray/git/vllm2/vllm/model_executor/layers/quantization/kernels/mixed_precision/bitblas.py", line 173, in process_weights_after_loading
layer.qweight,
^^^^^^^^^^^^^
File "/thearray/git/vllm/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in getattr
raise AttributeError(
AttributeError: 'RowParallelLinear' object has no attribute 'qweight'
[rank0]:[W730 06:13:20.938735941 ProcessGroupNCCL.cpp:1521] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/thearray/git/vllm/venv/bin/vllm", line 8, in
sys.exit(main())
^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/cli/main.py", line 54, in main
args.dispatch_function(args)
File "/thearray/git/vllm2/vllm/entrypoints/cli/serve.py", line 52, in cmd
uvloop.run(run_server(args))
File "/thearray/git/vllm/venv/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/thearray/git/vllm/venv/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 1809, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 1829, in run_server_worker
async with build_async_engine_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 165, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/entrypoints/openai/api_server.py", line 205, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/v1/engine/async_llm.py", line 164, in from_vllm_config
return cls(
^^^^
File "/thearray/git/vllm2/vllm/v1/engine/async_llm.py", line 118, in init
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/v1/engine/core_client.py", line 100, in make_async_mp_client
return AsyncMPClient(*client_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/thearray/git/vllm2/vllm/v1/engine/core_client.py", line 729, in init
super().init(
File "/thearray/git/vllm2/vllm/v1/engine/core_client.py", line 419, in init
with launch_core_engines(vllm_config, executor_class,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 144, in exit
next(self.gen)
File "/thearray/git/vllm2/vllm/v1/engine/utils.py", line 697, in launch_core_engines
wait_for_engine_startup(
File "/thearray/git/vllm2/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

@sirus same error 😒

Could you please try the fp4 quantization? Maybe it works? @cpatonn

similar error for me as well.. Perhaps docker version latest VLLM does not have the support for arch.

NFO 07-30 05:08:22 [init.py:235] Automatically detected platform cuda.
vllm-1 | INFO 07-30 05:08:25 [api_server.py:1755] vLLM API server version 0.10.0
vllm-1 | INFO 07-30 05:08:25 [cli_args.py:261] non-default args: {'model': 'cpatonn/GLM-4.5-Air-AWQ', 'trust_remote_code': True, 'dtype': 'float16', 'enforce_eager': True, 'served_model_name': ['default'], 'tensor_parallel_size': 2, 'disable_custom_all_reduce': True}
vllm-1 | Traceback (most recent call last):
vllm-1 | File "", line 198, in _run_module_as_main
vllm-1 | File "", line 88, in _run_code
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1856, in
vllm-1 | uvloop.run(run_server(args))
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
vllm-1 | return __asyncio.run(
vllm-1 | ^^^^^^^^^^^^^^
vllm-1 | File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
vllm-1 | return runner.run(main)
vllm-1 | ^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
vllm-1 | return self._loop.run_until_complete(task)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
vllm-1 | return await main
vllm-1 | ^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1791, in run_server
vllm-1 | await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1811, in run_server_worker
vllm-1 | async with build_async_engine_client(args, client_config) as engine_client:
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
vllm-1 | return await anext(self.gen)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
vllm-1 | async with build_async_engine_client_from_engine_args(
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
vllm-1 | return await anext(self.gen)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client_from_engine_args
vllm-1 | vllm_config = engine_args.create_engine_config(usage_context=usage_context)
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1004, in create_engine_config
vllm-1 | model_config = self.create_model_config()
vllm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 872, in create_model_config
vllm-1 | return ModelConfig(
vllm-1 | ^^^^^^^^^^^^
vllm-1 | File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 123, in init
vllm-1 | s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
vllm-1 | pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
vllm-1 | Value error, The checkpoint you are trying to load has model type glm4_moe but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
vllm-1 |
vllm-1 | You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git [type=value_error, input_value=ArgsKwargs((), {'model': ...attention_dtype': None}), input_type=ArgsKwargs]
vllm-1 | For further information visit https://errors.pydantic.dev/2.11/v/value_error

@siru . Sorry I cant reproduce the error in your case. I did rent an RTX Pro 6000 Blackwell and tested with driver version 570.169, 12.8 cuda, vllm version 0.10.1.dev202+g5bbaf492a and the latest model update commit 70707265d56916317d05d875e1df61151aece784, but still I could not reproduce the error in your case. The command that I used is

vllm serve cpatonn/GLM-4.5-Air-AWQ --dtype float16 

What was your command? My only suggestion is to use vllm pre-released version.

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

Edit: Hope this pastebin helps

4 x 3090s here. Same error.

(vllmtest) root@rtx3090-MS-7D90:~# vllm serve /root/models/GLM-4.5-Air-AWQ/ --trust-remote-code --disable-custom-all-reduce --max-model-len 32768 --port 14000 --served-model-name GLM-4.5-Air -tp 2 -pp 2 --dtype float16
INFO 07-30 20:21:01 [init.py:235] Automatically detected platform cuda.
(APIServer pid=446845) INFO 07-30 20:21:03 [api_server.py:1774] vLLM API server version 0.10.1.dev165+gab714131e
(APIServer pid=446845) INFO 07-30 20:21:03 [utils.py:326] non-default args: {'model_tag': '/root/models/GLM-4.5-Air-AWQ/', 'port': 14000, 'model': '/root/models/GLM-4.5-Air-AWQ/', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 32768, 'served_model_name': ['GLM-4.5-Air'], 'pipeline_parallel_size': 2, 'tensor_parallel_size': 2, 'disable_custom_all_reduce': True}
(APIServer pid=446845) INFO 07-30 20:21:08 [config.py:713] Resolved architecture: Glm4MoeForCausalLM
(APIServer pid=446845) WARNING 07-30 20:21:08 [config.py:3544] Casting torch.bfloat16 to torch.float16.
(APIServer pid=446845) INFO 07-30 20:21:08 [config.py:1724] Using max model len 32768
(APIServer pid=446845) INFO 07-30 20:21:08 [config.py:2535] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 07-30 20:21:12 [init.py:235] Automatically detected platform cuda.
INFO 07-30 20:21:13 [core.py:586] Waiting for init message from front-end.
INFO 07-30 20:21:13 [core.py:73] Initializing a V1 LLM engine (v0.10.1.dev165+gab714131e) with config: model='/root/models/GLM-4.5-Air-AWQ/', speculative_config=None, tokenizer='/root/models/GLM-4.5-Air-AWQ/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=GLM-4.5-Air, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-30 20:21:13 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-30 20:21:13 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_6e14b640'), local_subscribe_addr='ipc:///tmp/7aa97ffc-8553-4ecb-8c6e-f01b0572c4e3', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-30 20:21:17 [init.py:235] Automatically detected platform cuda.
INFO 07-30 20:21:17 [init.py:235] Automatically detected platform cuda.
INFO 07-30 20:21:17 [init.py:235] Automatically detected platform cuda.
INFO 07-30 20:21:17 [init.py:235] Automatically detected platform cuda.
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7bba2b90'), local_subscribe_addr='ipc:///tmp/e65a832e-9dae-4e5f-be83-1f98dea362bb', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_912d3632'), local_subscribe_addr='ipc:///tmp/953cdfff-803a-4f25-b1ac-8e34d3c8f6ce', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_eb3ff431'), local_subscribe_addr='ipc:///tmp/6355f803-b70f-43b8-b5ae-3185cedcae67', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:20 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_c3870a54'), local_subscribe_addr='ipc:///tmp/723d9add-913e-409d-88f0-d11fe55de3af', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:20 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:20 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:20 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:20 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:20 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:20 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:20 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:20 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_cec040aa'), local_subscribe_addr='ipc:///tmp/fbdd616c-99f4-44c7-a555-912effbc2a9b', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_6b5bf24c'), local_subscribe_addr='ipc:///tmp/c1d32e0e-d971-45d5-ad34-0432e6e79d14', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:21 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:21 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [init.py:1376] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [parallel_state.py:1102] rank 2 in world size 4 is assigned as DP rank 0, PP rank 1, TP rank 0, EP rank 0
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [parallel_state.py:1102] rank 3 in world size 4 is assigned as DP rank 0, PP rank 1, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [parallel_state.py:1102] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:21 [parallel_state.py:1102] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=447341) WARNING 07-30 20:21:21 [topk_topp_sampler.py:60] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=2 pid=447343) WARNING 07-30 20:21:21 [topk_topp_sampler.py:60] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=3 pid=447344) WARNING 07-30 20:21:21 [topk_topp_sampler.py:60] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=447342) WARNING 07-30 20:21:21 [topk_topp_sampler.py:60] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:21 [gpu_model_runner.py:1866] Starting to load model /root/models/GLM-4.5-Air-AWQ/...
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [gpu_model_runner.py:1866] Starting to load model /root/models/GLM-4.5-Air-AWQ/...
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [gpu_model_runner.py:1866] Starting to load model /root/models/GLM-4.5-Air-AWQ/...
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [gpu_model_runner.py:1866] Starting to load model /root/models/GLM-4.5-Air-AWQ/...
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [gpu_model_runner.py:1898] Loading model from scratch...
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [gpu_model_runner.py:1898] Loading model from scratch...
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:21 [gpu_model_runner.py:1898] Loading model from scratch...
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:21 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [gpu_model_runner.py:1898] Loading model from scratch...
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:21 [cuda.py:305] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [cuda.py:305] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:21 [cuda.py:305] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [cuda.py:305] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:21 [compressed_tensors_moe.py:81] Using CompressedTensorsWNA16MarlinMoEMethod
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:21 [compressed_tensors_moe.py:81] Using CompressedTensorsWNA16MarlinMoEMethod
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:22 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:22 [compressed_tensors_moe.py:81] Using CompressedTensorsWNA16MarlinMoEMethod
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:22 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:22 [compressed_tensors_moe.py:81] Using CompressedTensorsWNA16MarlinMoEMethod
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:22 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:22 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
Loading safetensors checkpoint shards: 0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 8% Completed | 1/13 [00:00<00:01, 6.63it/s]
Loading safetensors checkpoint shards: 15% Completed | 2/13 [00:00<00:01, 6.61it/s]
Loading safetensors checkpoint shards: 23% Completed | 3/13 [00:00<00:01, 6.59it/s]
Loading safetensors checkpoint shards: 31% Completed | 4/13 [00:00<00:01, 6.48it/s]
Loading safetensors checkpoint shards: 38% Completed | 5/13 [00:02<00:06, 1.20it/s]
Loading safetensors checkpoint shards: 46% Completed | 6/13 [00:04<00:08, 1.27s/it]
Loading safetensors checkpoint shards: 54% Completed | 7/13 [00:04<00:05, 1.09it/s]
Loading safetensors checkpoint shards: 62% Completed | 8/13 [00:06<00:06, 1.25s/it]
Loading safetensors checkpoint shards: 69% Completed | 9/13 [00:07<00:03, 1.00it/s]
Loading safetensors checkpoint shards: 77% Completed | 10/13 [00:09<00:03, 1.30s/it]
Loading safetensors checkpoint shards: 85% Completed | 11/13 [00:11<00:02, 1.46s/it]
Loading safetensors checkpoint shards: 92% Completed | 12/13 [00:13<00:01, 1.64s/it]
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:35 [default_loader.py:262] Loading weights took 13.46 seconds
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:13<00:00, 1.24s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:13<00:00, 1.04s/it]
(VllmWorker rank=0 pid=447341)
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:35 [default_loader.py:262] Loading weights took 13.53 seconds
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] WorkerProc failed to start.
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] Traceback (most recent call last):
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 531, in worker_main
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 400, in __init__
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] self.worker.load_model()
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 201, in load_model
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1899, in load_model
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] self.model = model_loader.load_model(
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] process_weights_after_loading(model, model_config, target_device)
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 112, in process_weights_after_loading
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] quant_method.process_weights_after_loading(module)
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 646, in process_weights_after_loading
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] layer.scheme.process_weights_after_loading(layer)
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py", line 197, in process_weights_after_loading
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] self.kernel.process_weights_after_loading(layer)
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/bitblas.py", line 173, in process_weights_after_loading
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] layer.qweight,
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] ^^^^^^^^^^^^^
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1940, in __getattr__
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] raise AttributeError(
(VllmWorker rank=1 pid=447342) ERROR 07-30 20:21:36 [multiproc_executor.py:557] AttributeError: 'RowParallelLinear' object has no attribute 'qweight'
(VllmWorker rank=3 pid=447344) INFO 07-30 20:21:36 [multiproc_executor.py:518] Parent process exited, terminating worker
(VllmWorker rank=2 pid=447343) INFO 07-30 20:21:36 [multiproc_executor.py:518] Parent process exited, terminating worker
(VllmWorker rank=1 pid=447342) INFO 07-30 20:21:36 [multiproc_executor.py:518] Parent process exited, terminating worker
(VllmWorker rank=0 pid=447341) INFO 07-30 20:21:36 [multiproc_executor.py:518] Parent process exited, terminating worker
[rank0]:[W730 20:21:36.175926761 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 07-30 20:21:38 [core.py:648] EngineCore failed to start.
ERROR 07-30 20:21:38 [core.py:648] Traceback (most recent call last):
ERROR 07-30 20:21:38 [core.py:648] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 639, in run_engine_core
ERROR 07-30 20:21:38 [core.py:648] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-30 20:21:38 [core.py:648] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 20:21:38 [core.py:648] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 455, in __init__
ERROR 07-30 20:21:38 [core.py:648] super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-30 20:21:38 [core.py:648] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 79, in __init__
ERROR 07-30 20:21:38 [core.py:648] self.model_executor = executor_class(vllm_config)
ERROR 07-30 20:21:38 [core.py:648] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 20:21:38 [core.py:648] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-30 20:21:38 [core.py:648] self._init_executor()
ERROR 07-30 20:21:38 [core.py:648] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _init_executor
ERROR 07-30 20:21:38 [core.py:648] self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 07-30 20:21:38 [core.py:648] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 20:21:38 [core.py:648] File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 470, in wait_for_ready
ERROR 07-30 20:21:38 [core.py:648] raise e from None
ERROR 07-30 20:21:38 [core.py:648] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 652, in run_engine_core
raise e
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 639, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 455, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 79, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 53, in __init__
self._init_executor()
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 94, in _init_executor
self.workers = WorkerProc.wait_for_ready(unready_workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 470, in wait_for_ready
raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=446845) Traceback (most recent call last):
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/bin/vllm", line 8, in
(APIServer pid=446845) sys.exit(main())
(APIServer pid=446845) ^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=446845) args.dispatch_function(args)
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 52, in cmd
(APIServer pid=446845) uvloop.run(run_server(args))
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=446845) return __asyncio.run(
(APIServer pid=446845) ^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=446845) return runner.run(main)
(APIServer pid=446845) ^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=446845) return self._loop.run_until_complete(task)
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=446845) return await main
(APIServer pid=446845) ^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1817, in run_server
(APIServer pid=446845) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1837, in run_server_worker
(APIServer pid=446845) async with build_async_engine_client(
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=446845) return await anext(self.gen)
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client
(APIServer pid=446845) async with build_async_engine_client_from_engine_args(
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=446845) return await anext(self.gen)
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 206, in build_async_engine_client_from_engine_args
(APIServer pid=446845) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 164, in from_vllm_config
(APIServer pid=446845) return cls(
(APIServer pid=446845) ^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 118, in init
(APIServer pid=446845) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 100, in make_async_mp_client
(APIServer pid=446845) return AsyncMPClient(*client_args)
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 729, in init
(APIServer pid=446845) super().init(
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 419, in init
(APIServer pid=446845) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=446845) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=446845) next(self.gen)
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=446845) wait_for_engine_startup(
(APIServer pid=446845) File "/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=446845) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=446845) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/home/rtx3090/miniconda3/envs/vllmtest/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@nagug Thanks for the info. GLM-4.5-Air uses a new architecture and configs, which zai-org only updates on transformers repo a short while ago. I would suggest building vllm docker with the latest vllm and transformers.

@nikichen777 . I think it was vllm. I tested on my 4x3090s using your vllm version and it did not work, but as I upgraded to the pre-release version, it ran well. Please update vllm using the following:

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

It works after update vllm. Thanks for your help! @cpatonn

@siru . Sorry I cant reproduce the error in your case. I did rent an RTX Pro 6000 Blackwell and tested with driver version 570.169, 12.8 cuda, vllm version 0.10.1.dev202+g5bbaf492a and the latest model update commit 70707265d56916317d05d875e1df61151aece784, but still I could not reproduce the error in your case. The command that I used is

vllm serve cpatonn/GLM-4.5-Air-AWQ --dtype float16 

What was your command? My only suggestion is to use vllm pre-released version.

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

Edit: Hope this pastebin helps

that's even worse than my compiled version

((venv) ) ➜ vllm2 git:(main) βœ— ./glm
INFO 07-30 21:24:30 [init.py:241] Automatically detected platform cuda.
(APIServer pid=921734) INFO 07-30 21:24:36 [api_server.py:1774] vLLM API server version 0.10.1.dev231+g9cb497bfa
(APIServer pid=921734) INFO 07-30 21:24:36 [utils.py:326] non-default args: {'model_tag': '/thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ', 'port': 9900, 'api_key': ['sk-58f67ef76e8845bc0b2ce17f578a22a20465463c4923f2e4b800703ec8caac25'], 'model': '/thearray/git/ob/text-generation-webui/models/GLM-4.5-Air-AWQ', 'dtype': 'float16', 'max_model_len': 8000}
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] Error in inspecting model architecture 'Glm4MoeForCausalLM'
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] Traceback (most recent call last):
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 820, in _run_in_subprocess
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] returned.check_returncode()
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/usr/lib/python3.12/subprocess.py", line 502, in check_returncode
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] raise CalledProcessError(self.returncode, self.args, self.stdout,
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] subprocess.CalledProcessError: Command '['/thearray/git/vllm/venv/bin/python3.12', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410]
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] The above exception was the direct cause of the following exception:
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410]
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] Traceback (most recent call last):
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 408, in _try_inspect_model_cls
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] return model.inspect_model_cls()
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] ^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 379, in inspect_model_cls
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] return _run_in_subprocess(
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] ^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/model_executor/models/registry.py", line 823, in _run_in_subprocess
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] raise RuntimeError(f"Error raised in subprocess:\n"
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] RuntimeError: Error raised in subprocess:
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] Traceback (most recent call last):
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "", line 189, in _run_module_as_main
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "", line 112, in _get_module_details
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/model_executor/init.py", line 4, in
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] from vllm.model_executor.parameter import (BasevLLMParameter,
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/model_executor/parameter.py", line 10, in
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] from vllm.distributed import get_tensor_model_parallel_rank
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/distributed/init.py", line 4, in
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] from .communication_op import *
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/distributed/communication_op.py", line 9, in
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] from .parallel_state import get_tp_group
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/distributed/parallel_state.py", line 150, in
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] from vllm.platforms import current_platform
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/platforms/init.py", line 267, in getattr
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] _current_platform = resolve_obj_by_qualname(
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] ^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/utils/init.py", line 2540, in resolve_obj_by_qualname
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] module = importlib.import_module(module_name)
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/usr/lib/python3.12/importlib/init.py", line 90, in import_module
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] return _bootstrap._gcd_import(name[level:], package, level)
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] File "/thearray/git/vllm2/vllm/platforms/cuda.py", line 18, in
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] import vllm._C # noqa
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] ^^^^^^^^^^^^^^
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410] ImportError: /thearray/git/vllm2/vllm/_C.abi3.so: undefined symbol: _ZN3c104cuda9SetDeviceEab
(APIServer pid=921734) ERROR 07-30 21:24:46 [registry.py:410]
(APIServer pid=921734) Traceback (most recent call last):
(APIServer pid=921734) File "/thearray/git/vllm/venv/bin/vllm", line 8, in
(APIServer pid=921734) sys.exit(main())
(APIServer pid=921734) ^^^^^^
(APIServer pid=921734) File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=921734) args.dispatch_function(args)
(APIServer pid=921734) File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 52, in cmd
(APIServer pid=921734) uvloop.run(run_server(args))
(APIServer pid=921734) File "/thearray/git/vllm/venv/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=921734) return __asyncio.run(
(APIServer pid=921734) ^^^^^^^^^^^^^^
(APIServer pid=921734) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=921734) return runner.run(main)
(APIServer pid=921734) ^^^^^^^^^^^^^^^^
(APIServer pid=921734) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=921734) return self._loop.run_until_complete(task)
(APIServer pid=921734) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=921734) File "/thearray/git/vllm/venv/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=921734) return await main
(APIServer pid=921734) ^^^^^^^^^^
(APIServer pid=921734) File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1817, in run_server
(APIServer pid=921734) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=921734) File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1837, in run_server_worker
(APIServer pid=921734) async with build_async_engine_client(
(APIServer pid=921734) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=921734) return await anext(self.gen)
(APIServer pid=921734) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=921734) File "/thearray/git/vllm/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client
(APIServer pid=921734) async with build_async_engine_client_from_engine_args(
(APIServer pid=921734) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@sirus . Okay, based on the stack trace, I think you now get the error :

ImportError: /thearray/git/vllm2/vllm/_C.abi3.so: undefined symbol: _ZN3c104cuda9SetDeviceEab

Could you try updating nvidia driver to the latest version, cuda version 12.8, pytorch 2.7.1 for cuda 12.8, and the latest pre-release vllm? That error is from unmatched cuda version with pytorch. I did get that error a few time and updating cuda version to 12.8 helped. I am not sure if nvidia driver played a role in causing this error, but just to be sure, I would recommend updating nvidia driver to the latest version.

Nvidia driver, cuda 12.8, and pytorch. Also please remember to install the latest pre release vllm version!

Im confident that this is the last error you get.

Edit: fixed some formatting.

Sign up or log in to comment