Cannot offload token_embd layers to CUDA
#4
by
WBB2500
- opened
This model has a similar issue that I had mentioned here: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/discussions/15
While the token_embd.weight is 420 MiB, the per_layer_token_embd.weight is 1837 MiB and is a significant amount to load in the RAM instead of VRAM. I believe using a non K quant that is supported https://github.com/ggml-org/llama.cpp/blob/8a2afb7520bbc8f9fa1bbe314d5f2807eb0116b2/ggml/src/ggml-cuda/getrows.cu#L156 would allow loading everything onto the GPU.
I am using gemma-3n-E4B-it-UD-Q4_K_XL.gguf and the latest commit 8846aace4934ad29651ea61b8c7e3f6b0556e3d2 gemma3n text-only support for llama.cpp.