Tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead

#15
by WBB2500 - opened

I am using Qwen3-30B-A3B-UD-Q4_K_XL and wanted to offload all the layers to my GPU, but llama.cpp does not support k-quants for what I believe is only the token_embd weights for the CUDA backend.

You can see the supported quant types here: https://github.com/ggml-org/llama.cpp/blob/8a2afb7520bbc8f9fa1bbe314d5f2807eb0116b2/ggml/src/ggml-cuda/getrows.cu#L156

I got the message about the tensor being moved to CPU by running llama-cli with the -v flag.

Sign up or log in to comment