Request triggers error message: The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1
#2
by
aldettinger
- opened
Hello granite community. I'm trying to experiment with the uncertainty-lora but face an issue.
The following error message is generated The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1
.
Note that I'm able to request the base model granite-3.2-8b-instruct
when served alone, however the error occur when I add the uncertainty lora adapter.
More details on my setup below:
I've built a docker image needed to run vllm on cpu only without gpus as below:
cd ~dev/projects/vllm-upstream
git checkout v0.8.2
docker build -f Dockerfile.cpu -t vllm-0.8.2-cpu-env --shm-size=4g .
From there, I'm able to serve the local base model and lora adapter that I previously downloaded:
docker run -it --mount type=bind,source=/home/user/dev/hugging-face-models/,target=/hf-models --rm --network=host vllm-0.8.2-cpu-env --model /hf-models/granite-3.2-8b-instruct --max-model-len 16384 --enable-lora --lora-modules uncertainty-lora=/hf-models/granite-uncertainty-3.2-8b-lora
I try to invoke the model:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "uncertainty-lora",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
But then vllm is crashing with error below:
[main_upstream @ uncertainty-quarkus-experiments]$ docker run -it --mount type=bind,source=/home/user/dev/hugging-face-models/,target=/hf-models --rm --network=host vllm-0.8.2-cpu-env --model /hf-models/granite-3.2-8b-instruct --max-model-len 16384 --enable-lora --lora-modules uncertainty-lora=/hf-models/granite-uncertainty-3.2-8b-lora
INFO 04-03 13:42:04 [__init__.py:239] Automatically detected platform cpu.
INFO 04-03 13:42:05 [api_server.py:981] vLLM API server version 0.8.2
INFO 04-03 13:42:05 [api_server.py:982] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='uncertainty-lora', path='/hf-models/granite-uncertainty-3.2-8b-lora', base_model_name=None)], prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/hf-models/granite-3.2-8b-instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=16384, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=True, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
INFO 04-03 13:42:08 [config.py:585] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
WARNING 04-03 13:42:08 [_logger.py:72] device type=cpu is not supported by the V1 Engine. Falling back to V0.
WARNING 04-03 13:42:08 [_logger.py:72] Environment variable VLLM_CPU_KVCACHE_SPACE (GiB) for CPU backend is not set, using 4 by default.
WARNING 04-03 13:42:08 [_logger.py:72] uni is not supported on CPU, fallback to mp distributed executor backend.
INFO 04-03 13:42:08 [api_server.py:241] Started engine process with PID 72
INFO 04-03 13:42:11 [__init__.py:239] Automatically detected platform cpu.
INFO 04-03 13:42:11 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='/hf-models/granite-3.2-8b-instruct', speculative_config=None, tokenizer='/hf-models/granite-3.2-8b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/hf-models/granite-3.2-8b-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 04-03 13:42:12 [cpu.py:40] Using Torch SDPA backend.
INFO 04-03 13:42:12 [importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
[W403 13:42:12.399087772 ProcessGroupGloo.cpp:727] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
INFO 04-03 13:42:13 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 9.39it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:00, 4.18it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:00<00:00, 3.34it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00, 3.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00, 3.38it/s]
INFO 04-03 13:42:14 [loader.py:447] Loading weights took 1.24 seconds
INFO 04-03 13:42:14 [punica_selector.py:18] Using PunicaWrapperCPU.
INFO 04-03 13:42:14 [executor_base.py:111] # cpu blocks: 1638, # CPU blocks: 0
INFO 04-03 13:42:14 [executor_base.py:116] Maximum concurrency for 16384 tokens per request: 1.60x
INFO 04-03 13:42:14 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 0.35 seconds
WARNING 04-03 13:42:15 [_logger.py:72] Pin memory is not supported on CPU.
INFO 04-03 13:42:15 [serving_models.py:174] Loaded new LoRA adapter: name 'uncertainty-lora', path '/hf-models/granite-uncertainty-3.2-8b-lora'
INFO 04-03 13:42:15 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-03 13:42:15 [launcher.py:26] Available routes are:
INFO 04-03 13:42:15 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-03 13:42:15 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-03 13:42:15 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-03 13:42:15 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-03 13:42:15 [launcher.py:34] Route: /health, Methods: GET
INFO 04-03 13:42:15 [launcher.py:34] Route: /load, Methods: GET
INFO 04-03 13:42:15 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 04-03 13:42:15 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-03 13:42:15 [launcher.py:34] Route: /version, Methods: GET
INFO 04-03 13:42:15 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /score, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-03 13:42:15 [launcher.py:34] Route: /invocations, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
WARNING 04-03 13:44:42 [tokenizer.py:281] No tokenizer found in /hf-models/granite-uncertainty-3.2-8b-lora, using base model tokenizer instead. (Exception: <class 'transformers.models.granite.configuration_granite.GraniteConfig'>)
INFO 04-03 13:44:43 [chat_utils.py:379] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 04-03 13:44:43 [logger.py:39] Received request chatcmpl-36cca86dc3814905ab7b48c9c49603dd: prompt: "<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.\nToday's Date: April 03, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>What is the capital of France?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16318, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: LoRARequest(lora_name='uncertainty-lora', lora_int_id=1, lora_path='/hf-models/granite-uncertainty-3.2-8b-lora', lora_local_path=None, long_lora_max_len=None, base_model_name=None), prompt_adapter_request: None.
WARNING 04-03 13:44:43 [tokenizer.py:281] No tokenizer found in /hf-models/granite-uncertainty-3.2-8b-lora, using base model tokenizer instead. (Exception: <class 'transformers.models.granite.configuration_granite.GraniteConfig'>)
INFO 04-03 13:44:43 [engine.py:310] Added request chatcmpl-36cca86dc3814905ab7b48c9c49603dd.
/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py:1128: UserWarning: An output with one or more elements was resized since it had shape [1, 256, 1], which does not match the required output shape [256, 1]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at /pytorch/aten/src/ATen/native/Resize.cpp:30.)
torch.matmul(self.embeddings_tensors,
CRITICAL 04-03 13:44:50 [launcher.py:116] MQLLMEngine is already dead, terminating server process
INFO: 127.0.0.1:39390 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR 04-03 13:44:50 [engine.py:160] RuntimeError('The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1')
ERROR 04-03 13:44:50 [engine.py:160] Traceback (most recent call last):
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 158, in start
ERROR 04-03 13:44:50 [engine.py:160] self.run_engine_loop()
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 221, in run_engine_loop
ERROR 04-03 13:44:50 [engine.py:160] request_outputs = self.engine_step()
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 247, in engine_step
ERROR 04-03 13:44:50 [engine.py:160] raise e
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 230, in engine_step
ERROR 04-03 13:44:50 [engine.py:160] return self.engine.step()
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1434, in step
ERROR 04-03 13:44:50 [engine.py:160] outputs = self.model_executor.execute_model(
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 284, in execute_model
ERROR 04-03 13:44:50 [engine.py:160] driver_outputs = self._driver_execute_model(execute_model_req)
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/mp_distributed_executor.py", line 144, in _driver_execute_model
ERROR 04-03 13:44:50 [engine.py:160] return self.driver_worker.execute_model(execute_model_req)
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 420, in execute_model
ERROR 04-03 13:44:50 [engine.py:160] output = self.model_runner.execute_model(
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-03 13:44:50 [engine.py:160] return func(*args, **kwargs)
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 663, in execute_model
ERROR 04-03 13:44:50 [engine.py:160] logits = self.model.compute_logits(hidden_states,
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/granite.py", line 403, in compute_logits
ERROR 04-03 13:44:50 [engine.py:160] logits = self.logits_processor(self.lm_head, hidden_states,
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-03 13:44:50 [engine.py:160] return self._call_impl(*args, **kwargs)
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-03 13:44:50 [engine.py:160] return forward_call(*args, **kwargs)
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 1159, in forward
ERROR 04-03 13:44:50 [engine.py:160] return type(self.base_layer).forward(self, *args, **kwargs)
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/logits_processor.py", line 70, in forward
ERROR 04-03 13:44:50 [engine.py:160] logits = self._get_logits(hidden_states, lm_head, embedding_bias)
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/layers.py", line 1150, in _get_logits
ERROR 04-03 13:44:50 [engine.py:160] self.punica_wrapper.add_lora_logits(logits, hidden_states,
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/punica_wrapper/punica_cpu.py", line 343, in add_lora_logits
ERROR 04-03 13:44:50 [engine.py:160] bgmv_expand(buffer,
ERROR 04-03 13:44:50 [engine.py:160] File "/usr/local/lib/python3.10/dist-packages/vllm/lora/ops/torch_ops/lora_ops.py", line 40, in bgmv_expand
ERROR 04-03 13:44:50 [engine.py:160] output_tensor[:, :outputs.shape[1]] += outputs[:limit, :]
ERROR 04-03 13:44:50 [engine.py:160] RuntimeError: The size of tensor a (49472) must match the size of tensor b (49664) at non-singleton dimension 1
Does anyone have any advice about this issue ? Any solution/workaround ?