Cannot offload token_embd layers to CUDA

#4
by WBB2500 - opened

This model has a similar issue that I had mentioned here: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/discussions/15

While the token_embd.weight is 420 MiB, the per_layer_token_embd.weight is 1837 MiB and is a significant amount to load in the RAM instead of VRAM. I believe using a non K quant that is supported https://github.com/ggml-org/llama.cpp/blob/8a2afb7520bbc8f9fa1bbe314d5f2807eb0116b2/ggml/src/ggml-cuda/getrows.cu#L156 would allow loading everything onto the GPU.

image.png

I am using gemma-3n-E4B-it-UD-Q4_K_XL.gguf and the latest commit 8846aace4934ad29651ea61b8c7e3f6b0556e3d2 gemma3n text-only support for llama.cpp.

Sign up or log in to comment