Tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
#15
by
WBB2500
- opened
I am using Qwen3-30B-A3B-UD-Q4_K_XL and wanted to offload all the layers to my GPU, but llama.cpp does not support k-quants for what I believe is only the token_embd weights for the CUDA backend.
You can see the supported quant types here: https://github.com/ggml-org/llama.cpp/blob/8a2afb7520bbc8f9fa1bbe314d5f2807eb0116b2/ggml/src/ggml-cuda/getrows.cu#L156
I got the message about the tensor being moved to CPU by running llama-cli with the -v flag.