ValueError: Error in model execution: Attempted to assign 2929 = 2929 multimodal tokens to 0 placeholders
Hi,
I am struggling to set up the model with vllm serving by API but failed with following.
It is quite strange that the serving vllm recognized the user prompt as 'empty' even though I gave vllm text prompt and image_url as well .
prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|im_end|>\n<|im_start|>assistant\n'
Here is the list of package I've installed.
pip list | grep -E "vllm|torch|transformer"
torch 2.4.0+cu121
torchaudio 2.4.0+cu121
torchvision 0.19.0+cu121
transformers 4.45.2
vllm 0.6.3
Following is curl cmd calling query with text prompt and image then vllm dumped logs with termination.
# call API via curl
curl -X POST "http://localhost:33380/v1/chat/completions" \
> -H "Content-Type: application/json" \
> --data '{
> "model": "VARCO-VISION-14B-HF",
> "messages": [
> {
> "role": "user",
> "content": [
> {
> "type": "text",
> "text": "Describe this image in one sentence."
> },
> {
> "type": "image_url",
> "image_url": {
> "url": "http://images.cocodataset.org/val2017/000000039769.jpg"
> }
> }
> ]
> }
> ]
> }'
# vlm log dump
INFO 12-03 19:10:06 logger.py:37] Received request chat-be36835baa7d434e80800c724f400ce7: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32749, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 151645, 198, 151644, 77091, 198], lora_request: None, prompt_adapter_request: None.
INFO 12-03 19:10:06 engine.py:292] Added request chat-be36835baa7d434e80800c724f400ce7.
INFO 12-03 19:10:07 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241203-191007.pkl...
WARNING 12-03 19:10:07 model_runner_base.py:143] Failed to pickle inputs of failed execution: Can't pickle local object 'weak_bind.<locals>.weak_bound'
INFO: 127.0.0.1:43838 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
ERROR 12-03 19:10:07 engine.py:160] ValueError('Error in model execution: Attempted to assign 2929 = 2929 multimodal tokens to 0 placeholders')
ERROR 12-03 19:10:07 engine.py:160] Traceback (most recent call last):
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 12-03 19:10:07 engine.py:160] return func(*args, **kwargs)
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model
ERROR 12-03 19:10:07 engine.py:160] hidden_or_intermediate_states = model_executable(
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
ERROR 12-03 19:10:07 engine.py:160] return self._call_impl(*args, **kwargs)
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
ERROR 12-03 19:10:07 engine.py:160] return forward_call(*args, **kwargs)
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/model_executor/models/llava_onevision.py", line 836, in forward
ERROR 12-03 19:10:07 engine.py:160] inputs_embeds = merge_multimodal_embeddings(
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 304, in merge_multimodal_embeddings
ERROR 12-03 19:10:07 engine.py:160] raise ValueError(
ERROR 12-03 19:10:07 engine.py:160] ValueError: Attempted to assign 2929 = 2929 multimodal tokens to 0 placeholders
ERROR 12-03 19:10:07 engine.py:160]
ERROR 12-03 19:10:07 engine.py:160] The above exception was the direct cause of the following exception:
ERROR 12-03 19:10:07 engine.py:160]
ERROR 12-03 19:10:07 engine.py:160] Traceback (most recent call last):
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
ERROR 12-03 19:10:07 engine.py:160] self.run_engine_loop()
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
ERROR 12-03 19:10:07 engine.py:160] request_outputs = self.engine_step()
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 239, in engine_step
ERROR 12-03 19:10:07 engine.py:160] raise e
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
ERROR 12-03 19:10:07 engine.py:160] return self.engine.step()
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1386, in step
ERROR 12-03 19:10:07 engine.py:160] outputs = self.model_executor.execute_model(
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 134, in execute_model
ERROR 12-03 19:10:07 engine.py:160] output = self.driver_worker.execute_model(execute_model_req)
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 327, in execute_model
ERROR 12-03 19:10:07 engine.py:160] output = self.model_runner.execute_model(
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-03 19:10:07 engine.py:160] return func(*args, **kwargs)
ERROR 12-03 19:10:07 engine.py:160] File "{path}/lib/python3.10/site-packages/vllm/worker/model_runner_base.py", line 146, in _wrapper
ERROR 12-03 19:10:07 engine.py:160] raise type(err)(f"Error in model execution: "
ERROR 12-03 19:10:07 engine.py:160] ValueError: Error in model execution: Attempted to assign 2929 = 2929 multimodal tokens to 0 placeholders
ERROR 12-03 19:10:16 client.py:250] TimeoutError('No heartbeat received from MQLLMEngine')
ERROR 12-03 19:10:16 client.py:250] NoneType: None
Thank you for your inquiry.
We’ve recently updated the chat_template in the tokenizer_config.json for NCSOFT/VARCO-VISION-14B-HF to ensure optimal performance and alignment with user needs.
To access the latest updates, we recommend downloading the updated tokenizer_config.json.
Alternatively, you may remove the previously downloaded checkpoint from your cache folder and redownload it.
If you have any further questions or require assistance, please feel free to reach out.