unsloth/gemma-3n-E4B-it-GGUF · Cannot offload token

While the token_embd.weight is 420 MiB, the per_layer_token_embd.weight is 1837 MiB and is a significant amount to load in the RAM instead of VRAM. I believe using a non K quant that is supported https://github.com/ggml-org/llama.cpp/blob/8a2afb7520bbc8f9fa1bbe314d5f2807eb0116b2/ggml/src/ggml-cuda/getrows.cu#L156 would allow loading everything onto the GPU.

I am using gemma-3n-E4B-it-UD-Q4_K_XL.gguf and the latest commit 8846aace4934ad29651ea61b8c7e3f6b0556e3d2 gemma3n text-only support for llama.cpp.

unsloth
/

gemma-3n-E4B-it-GGUF