Can't run this in VLLM

#5
by joeofportland - opened

Hello,
I'd love to try out this model but it seems I can't run it in VLLM any idea why? Maybe it's not supported yet?

(venv) root@fdc7a733a395:/workspace# VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 vllm serve huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated --port 80
INFO 08-08 00:52:02 [init.py:241] Automatically detected platform cuda.
(APIServer pid=20297) INFO 08-08 00:52:05 [api_server.py:1787] vLLM API server version 0.10.2.dev2+gf5635d62e.d20250807
(APIServer pid=20297) INFO 08-08 00:52:05 [utils.py:326] non-default args: {'model_tag': 'huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated', 'port': 80, 'model': 'huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated'}
config.json: 1.69kB [00:00, 5.76MB/s]
(APIServer pid=20297) INFO 08-08 00:52:11 [config.py:726] Resolved architecture: GptOssForCausalLM
(APIServer pid=20297) INFO 08-08 00:52:11 [config.py:1759] Using max model len 131072
(APIServer pid=20297) INFO 08-08 00:52:12 [config.py:2588] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=20297) INFO 08-08 00:52:12 [config.py:244] Overriding cuda graph sizes to [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024]
tokenizer_config.json: 4.38kB [00:00, 17.3MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27.9M/27.9M [00:00<00:00, 132MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 463/463 [00:00<00:00, 3.51MB/s]
chat_template.jinja: 15.4kB [00:00, 36.8MB/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 175/175 [00:00<00:00, 1.41MB/s]
INFO 08-08 00:52:18 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=20562) INFO 08-08 00:52:20 [core.py:654] Waiting for init message from front-end.
(EngineCore_0 pid=20562) INFO 08-08 00:52:20 [core.py:73] Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated', speculative_config=None, tokenizer='huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='openai'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":1024,"local_cache_dir":null}
(EngineCore_0 pid=20562)
(EngineCore_0 pid=20562) LL LL MMM MMM
(EngineCore_0 pid=20562) LL LL MMMM MMMM
(EngineCore_0 pid=20562) V LL LL MM MM MM MM
(EngineCore_0 pid=20562) vvvv VVVV LL LL MM MM MM MM
(EngineCore_0 pid=20562) vvvv VVVV LL LL MM MMM MM
(EngineCore_0 pid=20562) vvv VVVV LL LL MM M MM
(EngineCore_0 pid=20562) vvVVVV LL LL MM MM
(EngineCore_0 pid=20562) VVVV LLLLLLLLLL LLLLLLLLL M M
(EngineCore_0 pid=20562)
(EngineCore_0 pid=20562) W0808 00:52:20.166000 20562 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(EngineCore_0 pid=20562) W0808 00:52:20.166000 20562 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[W808 00:52:20.864917353 ProcessGroupNCCL.cpp:915] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_0 pid=20562) INFO 08-08 00:52:20 [parallel_state.py:1102] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=20562) INFO 08-08 00:52:20 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(EngineCore_0 pid=20562) INFO 08-08 00:52:20 [gpu_model_runner.py:1913] Starting to load model huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated...
(EngineCore_0 pid=20562) INFO 08-08 00:52:21 [gpu_model_runner.py:1945] Loading model from scratch...
(EngineCore_0 pid=20562) INFO 08-08 00:52:21 [cuda.py:286] Using Triton backend on V1 engine.
(EngineCore_0 pid=20562) WARNING 08-08 00:52:21 [rocm.py:29] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
(EngineCore_0 pid=20562) WARNING 08-08 00:52:21 [rocm.py:40] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
(EngineCore_0 pid=20562) INFO 08-08 00:52:21 [triton_attn.py:263] Using vllm unified attention for TritonAttentionImpl
(EngineCore_0 pid=20562) INFO 08-08 00:52:21 [weight_utils.py:296] Using model weights format ['*.safetensors']
model-00001-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.50G/4.50G [00:48<00:00, 92.8MB/s]
model-00003-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:50<00:00, 97.7MB/s]
model-00005-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:50<00:00, 97.3MB/s]
model-00007-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:52<00:00, 93.8MB/s]
model-00002-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:52<00:00, 93.6MB/s]
model-00008-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:52<00:00, 93.4MB/s]
model-00006-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:54<00:00, 90.7MB/s]
model-00004-of-00009.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.94G/4.94G [00:56<00:00, 87.5MB/s]
model-00009-of-00009.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.75G/2.75G [00:15<00:00, 182MB/s]
(EngineCore_0 pid=20562) INFO 08-08 00:53:26 [weight_utils.py:312] Time spent downloading weights for huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated: 64.499592 seconds[00:56<00:00, 92.0MB/s]
model.safetensors.index.json: 34.1kB [00:00, 93.1MB/s]████████████████████▌ | 849M/2.75G [00:05<00:07, 248MB/s]
Loading safetensors checkpoint shards: 0% Completed | 0/9 [00:00<?, ?it/s]███████████████████ | 1.35G/2.75G [00:07<00:05, 254MB/s]
(EngineCore_0 pid=20562) Warning: model.layers.23.mlp.experts.down_proj not found in params_dict████████████████████████████████████████████████████████▊| 2.75G/2.75G [00:14<00:00, 115MB/s]
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] EngineCore failed to start.
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] Traceback (most recent call last):
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] self._init_executor()
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] self.collective_rpc("load_model")
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] return func(*args, **kwargs)
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] self.model = model_loader.load_model(
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] self.load_weights(model, model_config)
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 259, in load_weights
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] loaded_weights = model.load_weights(
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] File "/root/venv/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 429, in load_weights
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] param = params_dict[new_name]
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] ~~~~~~~~~~~^^^^^^^^^^
(EngineCore_0 pid=20562) ERROR 08-08 00:53:30 [core.py:718] KeyError: 'model.layers.23.mlp.experts.w2_bias'
(EngineCore_0 pid=20562) Process EngineCore_0:
(EngineCore_0 pid=20562) Traceback (most recent call last):
(EngineCore_0 pid=20562) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=20562) self.run()
(EngineCore_0 pid=20562) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=20562) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_0 pid=20562) raise e
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_0 pid=20562) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_0 pid=20562) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 510, in __init__
(EngineCore_0 pid=20562) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_0 pid=20562) self.model_executor = executor_class(vllm_config)
(EngineCore_0 pid=20562) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_0 pid=20562) self._init_executor()
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 49, in _init_executor
(EngineCore_0 pid=20562) self.collective_rpc("load_model")
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=20562) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=20562) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 2948, in run_method
(EngineCore_0 pid=20562) return func(*args, **kwargs)
(EngineCore_0 pid=20562) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 211, in load_model
(EngineCore_0 pid=20562) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1946, in load_model
(EngineCore_0 pid=20562) self.model = model_loader.load_model(
(EngineCore_0 pid=20562) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(EngineCore_0 pid=20562) self.load_weights(model, model_config)
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 259, in load_weights
(EngineCore_0 pid=20562) loaded_weights = model.load_weights(
(EngineCore_0 pid=20562) ^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=20562) File "/root/venv/lib/python3.12/site-packages/vllm/model_executor/models/gpt_oss.py", line 429, in load_weights
(EngineCore_0 pid=20562) param = params_dict[new_name]
(EngineCore_0 pid=20562) ~~~~~~~~~~~^^^^^^^^^^
(EngineCore_0 pid=20562) KeyError: 'model.layers.23.mlp.experts.w2_bias'
Loading safetensors checkpoint shards: 0% Completed | 0/9 [00:04<?, ?it/s]
(EngineCore_0 pid=20562)
[rank0]:[W808 00:53:31.440099671 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=20297) Traceback (most recent call last):
(APIServer pid=20297) File "/root/venv/bin/vllm", line 10, in
(APIServer pid=20297) sys.exit(main())
(APIServer pid=20297) ^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=20297) args.dispatch_function(args)
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=20297) uvloop.run(run_server(args))
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=20297) return __asyncio.run(
(APIServer pid=20297) ^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=20297) return runner.run(main)
(APIServer pid=20297) ^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=20297) return self._loop.run_until_complete(task)
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=20297) return await main
(APIServer pid=20297) ^^^^^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1827, in run_server
(APIServer pid=20297) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1847, in run_server_worker
(APIServer pid=20297) async with build_async_engine_client(
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=20297) return await anext(self.gen)
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 167, in build_async_engine_client
(APIServer pid=20297) async with build_async_engine_client_from_engine_args(
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=20297) return await anext(self.gen)
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 209, in build_async_engine_client_from_engine_args
(APIServer pid=20297) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/utils/init.py", line 1520, in inner
(APIServer pid=20297) return fn(*args, **kwargs)
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 173, in from_vllm_config
(APIServer pid=20297) return cls(
(APIServer pid=20297) ^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 119, in init
(APIServer pid=20297) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 101, in make_async_mp_client
(APIServer pid=20297) return AsyncMPClient(*client_args)
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 733, in init
(APIServer pid=20297) super().init(
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in init
(APIServer pid=20297) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=20297) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20297) File "/root/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=20297) next(self.gen)
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
(APIServer pid=20297) wait_for_engine_startup(
(APIServer pid=20297) File "/root/venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
(APIServer pid=20297) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=20297) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Sign up or log in to comment