unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit

INFO 04-10 07:13:17 [init.py:239] Automatically detected platform cuda.
INFO 04-10 07:13:20 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-10 07:13:20 [api_server.py:1035] args: Namespace(host='0.0.0.0', port=8010, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.15, num_gpu_blocks_override=None, max_num_batched_tokens=4096, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-10 07:13:28 [config.py:600] This model supports multiple tasks: {'score', 'classify', 'generate', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 04-10 07:13:30 [config.py:679] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-10 07:13:30 [config.py:1600] Defaulting to use mp for distributed inference
INFO 04-10 07:13:30 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=4096.
INFO 04-10 07:13:35 [init.py:239] Automatically detected platform cuda.
INFO 04-10 07:13:37 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-10 07:13:37 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-10 07:13:37 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_aaa5f28e'), local_subscribe_addr='ipc:///tmp/ee84773d-15e2-4160-b693-f2c31238e650', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-10 07:13:41 [init.py:239] Automatically detected platform cuda.
WARNING 04-10 07:13:43 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-10 07:13:43 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-2efae/VLLM_TRACE_FUNCTION_for_process_125_thread_139648289789056_at_2025-04-10_07:13:43.807798.log
WARNING 04-10 07:13:50 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f017f82ef90>
(VllmWorker rank=0 pid=125) INFO 04-10 07:13:50 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_75fb11a3'), local_subscribe_addr='ipc:///tmp/4f352f02-b9fc-470e-b6c8-9677bddf02d5', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-10 07:13:53 [init.py:239] Automatically detected platform cuda.
WARNING 04-10 07:13:55 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-10 07:13:55 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-2efae/VLLM_TRACE_FUNCTION_for_process_142_thread_140026205201536_at_2025-04-10_07:13:55.925549.log
WARNING 04-10 07:14:01 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f597cd86c60>
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:01 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_fb3bd266'), local_subscribe_addr='ipc:///tmp/1e3fa98f-6f43-4620-bbd3-558514d90e84', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:02 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:02 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:02 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:02 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:02 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:17 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=142) WARNING 04-10 07:14:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=125) WARNING 04-10 07:14:17 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:17 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a3f5bd13'), local_subscribe_addr='ipc:///tmp/951682b1-f13b-4c83-84cd-5dbe10d5f943', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:17 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:17 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:17 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:17 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:24 [gpu_model_runner.py:1258] Starting to load model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit...
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:25 [gpu_model_runner.py:1258] Starting to load model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit...
(VllmWorker rank=1 pid=142) INFO 04-10 07:14:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=125) INFO 04-10 07:14:27 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=142) WARNING 04-10 07:14:27 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=142) WARNING 04-10 07:14:27 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=125) WARNING 04-10 07:14:27 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=125) WARNING 04-10 07:14:27 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
CRITICAL 04-10 07:14:27 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(VllmWorker rank=1 pid=142) Process SpawnProcess-1:2:
(VllmWorker rank=1 pid=142) Traceback (most recent call last):
CRITICAL 04-10 07:14:27 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in init
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
return AsyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in init
super().init(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in init
engine.proc_handle.wait_for_startup()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
if self.reader.recv()["status"] != "READY":
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
raise EOFError
EOFError
INFO 04-10 07:15:12 [init.py:239] Automatically detected platform cuda.
INFO 04-10 07:15:14 [api_server.py:1034] vLLM API server version 0.8.3
INFO 04-10 07:15:14 [api_server.py:1035] args: Namespace(host='0.0.0.0', port=8010, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.15, num_gpu_blocks_override=None, max_num_batched_tokens=4096, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-10 07:15:22 [config.py:600] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 04-10 07:15:23 [config.py:679] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-10 07:15:23 [config.py:1600] Defaulting to use mp for distributed inference
INFO 04-10 07:15:23 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=4096.
INFO 04-10 07:15:28 [init.py:239] Automatically detected platform cuda.
INFO 04-10 07:15:31 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-10 07:15:31 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-10 07:15:31 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_236efceb'), local_subscribe_addr='ipc:///tmp/9fafc1b0-68e6-4771-b860-73cc46819bbb', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-10 07:15:34 [init.py:239] Automatically detected platform cuda.
WARNING 04-10 07:15:36 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-10 07:15:36 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-7b406/VLLM_TRACE_FUNCTION_for_process_125_thread_139877240632448_at_2025-04-10_07:15:36.762433.log
WARNING 04-10 07:15:42 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f36a03ac170>
(VllmWorker rank=0 pid=125) INFO 04-10 07:15:42 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_6d1b305b'), local_subscribe_addr='ipc:///tmp/0963a6a1-2143-4976-8b11-0f81f2f1b6a5', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-10 07:15:45 [init.py:239] Automatically detected platform cuda.
WARNING 04-10 07:15:48 [logger.py:202] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 04-10 07:15:48 [logger.py:206] Trace frame log is saved to /tmp/root/vllm/vllm-instance-7b406/VLLM_TRACE_FUNCTION_for_process_142_thread_140461142779008_at_2025-04-10_07:15:48.085142.log
WARNING 04-10 07:15:53 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fbebf43bb60>
(VllmWorker rank=1 pid=142) INFO 04-10 07:15:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_727d23ca'), local_subscribe_addr='ipc:///tmp/3d693d02-99a8-49c8-8141-ebaa1601146c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=142) INFO 04-10 07:15:54 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=125) INFO 04-10 07:15:54 [utils.py:990] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=142) INFO 04-10 07:15:54 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=125) INFO 04-10 07:15:54 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=125) INFO 04-10 07:15:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=142) INFO 04-10 07:15:54 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=125) WARNING 04-10 07:15:54 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=1 pid=142) WARNING 04-10 07:15:54 [custom_all_reduce.py:146] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=125) INFO 04-10 07:15:54 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_a35d9b38'), local_subscribe_addr='ipc:///tmp/b4f4e50c-9839-4258-9e48-f8425ea8c64c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=125) INFO 04-10 07:15:54 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=142) INFO 04-10 07:15:54 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=125) INFO 04-10 07:15:54 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=142) INFO 04-10 07:15:54 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=142) INFO 04-10 07:16:01 [gpu_model_runner.py:1258] Starting to load model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit...
(VllmWorker rank=0 pid=125) INFO 04-10 07:16:01 [gpu_model_runner.py:1258] Starting to load model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit...
(VllmWorker rank=1 pid=142) INFO 04-10 07:16:03 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=125) INFO 04-10 07:16:03 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=125) WARNING 04-10 07:16:04 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=125) WARNING 04-10 07:16:04 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=142) WARNING 04-10 07:16:04 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=142) WARNING 04-10 07:16:04 [config.py:3785] torch.compile is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=125) Process SpawnProcess-1:1:
CRITICAL 04-10 07:16:04 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(VllmWorker rank=0 pid=125) Traceback (most recent call last):
CRITICAL 04-10 07:16:04 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
(VllmWorker rank=0 pid=125) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorker rank=0 pid=125) self.run()
(VllmWorker rank=0 pid=125) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(VllmWorker rank=0 pid=125) self._target(*self._args, **self._kwargs)
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 317, in worker_main
(VllmWorker rank=0 pid=125) worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=0 pid=125) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 248, in init
(VllmWorker rank=0 pid=125) self.worker.load_model()
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 136, in load_model
(VllmWorker rank=0 pid=125) self.model_runner.load_model()
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1261, in load_model
(VllmWorker rank=0 pid=125) self.model = get_model(vllm_config=self.vllm_config)
(VllmWorker rank=0 pid=125) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
(VllmWorker rank=0 pid=125) return loader.load_model(vllm_config=vllm_config)
(VllmWorker rank=0 pid=125) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 441, in load_model
(VllmWorker rank=0 pid=125) model = _initialize_model(vllm_config=vllm_config)
(VllmWorker rank=0 pid=125) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
(VllmWorker rank=0 pid=125) return model_class(vllm_config=vllm_config, prefix=prefix)
(VllmWorker rank=0 pid=125) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=125) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama4.py", line 713, in init
(VllmWorker rank=0 pid=125) self.language_model = init_vllm_registered_model(
(VllmWorker rank=0 pid=125) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in init
self.engine_core = EngineCoreClient.make_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
return AsyncMPClient(vllm_config, executor_class, log_stats)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in init
super().init(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in init
engine.proc_handle.wait_for_startup()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
if self.reader.recv()["status"] != "READY":
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
raise EOFError
EOFError

vllm-llama4-scout-4bnb:
image: vllm/vllm-openai:latest
container_name: vllm-llama4-scout-4bnb
environment:
- NVIDIA_VISIBLE_DEVICES=0,1
- HUGGING_FACE_HUB_TOKEN=****************
- VLLM_TRACE_FUNCTION=1
- VLLM_DISABLE_COMPILE_CACHE=1
volumes:
- llama4-scout:/root/.cache/huggingface
ports:
- "8015:8010"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
command: >
--model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit
--tensor-parallel-size 2
--host 0.0.0.0
--port 8010
--max-model-len 32768
--gpu_memory_utilization 0.15
--max-num-batched-tokens=4096
ipc: host

running on 2x rtx6000 ada (2x48gb vram)

unsloth
/

Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit

vllm not starting (vllm-docker)