assert self.quant_method is not None

by Seri0usLee - opened Apr 6

Apr 6

I am running vllm like so:

"python3 -m vllm.entrypoints.openai.api_server --model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit --served-model-name Llama-4-Scout --port 9000 --max-model-len 100000"

And also like so:

"python3 -m vllm.entrypoints.openai.api_server --model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit --served-model-name Llama-4-Scout --port 9000 --max-model-len 100000 --quantization bitsandbytes"

But both give me the following error

INFO 04-06 00:57:19 [core.py:61] Initializing a V1 LLM engine (v0.8.3) with config: model='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', speculative_config=None, tokenizer='unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=100000, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Llama-4-Scout, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-06 00:57:28 [utils.py:2413] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in 
[rank0]:[W406 00:57:28.817395294 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 04-06 00:57:28 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-06 00:57:28 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-06 00:57:35 [gpu_model_runner.py:1258] Starting to load model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit...
INFO 04-06 00:57:35 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
WARNING 04-06 00:57:36 [config.py:3785] `torch.compile` is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
WARNING 04-06 00:57:36 [config.py:3785] `torch.compile` is turned on, but the model unsloth/Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit does not support it. Please open an issue on GitHub if you want it to be supported.
ERROR 04-06 00:57:37 [core.py:390] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-06 00:57:37 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-06 00:57:37 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 319, in __init__
ERROR 04-06 00:57:37 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 67, in __init__
ERROR 04-06 00:57:37 [core.py:390]     self.model_executor = executor_class(vllm_config)
ERROR 04-06 00:57:37 [core.py:390]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-06 00:57:37 [core.py:390]     self._init_executor()
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-06 00:57:37 [core.py:390]     self.collective_rpc("load_model")
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-06 00:57:37 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-06 00:57:37 [core.py:390]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2347, in run_method
ERROR 04-06 00:57:37 [core.py:390]     return func(*args, **kwargs)
ERROR 04-06 00:57:37 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 136, in load_model
ERROR 04-06 00:57:37 [core.py:390]     self.model_runner.load_model()
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1261, in load_model
ERROR 04-06 00:57:37 [core.py:390]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-06 00:57:37 [core.py:390]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-06 00:57:37 [core.py:390]     return loader.load_model(vllm_config=vllm_config)
ERROR 04-06 00:57:37 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1278, in load_model
ERROR 04-06 00:57:37 [core.py:390]     model = _initialize_model(vllm_config=vllm_config)
ERROR 04-06 00:57:37 [core.py:390]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-06 00:57:37 [core.py:390]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-06 00:57:37 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mllama4.py", line 713, in __init__
ERROR 04-06 00:57:37 [core.py:390]     self.language_model = init_vllm_registered_model(
ERROR 04-06 00:57:37 [core.py:390]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 286, in init_vllm_registered_model
ERROR 04-06 00:57:37 [core.py:390]     return _initialize_model(vllm_config=vllm_config, prefix=prefix)
ERROR 04-06 00:57:37 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 127, in _initialize_model
ERROR 04-06 00:57:37 [core.py:390]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-06 00:57:37 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 479, in __init__
ERROR 04-06 00:57:37 [core.py:390]     LlamaForCausalLM.__init__(self,
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 486, in __init__
ERROR 04-06 00:57:37 [core.py:390]     self.model = self._init_model(vllm_config=vllm_config,
ERROR 04-06 00:57:37 [core.py:390]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 488, in _init_model
ERROR 04-06 00:57:37 [core.py:390]     return Llama4Model(vllm_config=vllm_config,
ERROR 04-06 00:57:37 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-06 00:57:37 [core.py:390]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 334, in __init__
ERROR 04-06 00:57:37 [core.py:390]     super().__init__(vllm_config=vllm_config,
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__
ERROR 04-06 00:57:37 [core.py:390]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 321, in __init__
ERROR 04-06 00:57:37 [core.py:390]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-06 00:57:37 [core.py:390]                                                     ^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 610, in make_layers
ERROR 04-06 00:57:37 [core.py:390]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-06 00:57:37 [core.py:390]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 323, in 
ERROR 04-06 00:57:37 [core.py:390]     lambda prefix: layer_type(config=config,
ERROR 04-06 00:57:37 [core.py:390]                    ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 283, in __init__
ERROR 04-06 00:57:37 [core.py:390]     self.feed_forward = Llama4MoE(
ERROR 04-06 00:57:37 [core.py:390]                         ^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama4.py", line 73, in __init__
ERROR 04-06 00:57:37 [core.py:390]     self.experts = FusedMoE(
ERROR 04-06 00:57:37 [core.py:390]                    ^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 502, in __init__
ERROR 04-06 00:57:37 [core.py:390]     assert self.quant_method is not None
ERROR 04-06 00:57:37 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-06 00:57:37 [core.py:390] AssertionError
ERROR 04-06 00:57:37 [core.py:390] 
CRITICAL 04-06 00:57:37 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Is it not possible yet to run 4bit versions of this model?

fsaudm

Apr 6

Dont think so...

I tried this:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --dtype bfloat16 \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.95 \
    --quantization bitsandbytes \
    --load-format bitsandbytes

, not working of course. I would think @Unsloth and @vLLM are working on this right now, hence the "it only works on unsloth at the moment warning in this repo.

Skyeaee

Apr 9

hello, did you fix this error, i also have this error then i trying to run this model by vllm serve Llama-4-Scout-17B-16E-Instruct-unsloth-bnb-4bit/ and encounter error

ERROR 04-09 19:22:27 [core.py:386]   File "/data1/yinjian/python-scripts/vllm/vllm/model_executor/models/llama4.py", line 73, in __init__
ERROR 04-09 19:22:27 [core.py:386]     self.experts = FusedMoE(
ERROR 04-09 19:22:27 [core.py:386]                    ^^^^^^^^^
ERROR 04-09 19:22:27 [core.py:386]   File "/data1/yinjian/python-scripts/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 502, in __init__
ERROR 04-09 19:22:27 [core.py:386]     assert self.quant_method is not None
ERROR 04-09 19:22:27 [core.py:386]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

hdnh2006

Apr 9

•

edited Apr 9

Same here! I think it is a bug of vLLM, shouldn't be the opposite?

According to this comment they say they don't support MoE quantization. So I think it should be something like:
assert self.quant_method is None

Agree?

Skyeaee

Apr 10

This makes sense.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment