Post
1696
Model pruning for the masses!
As of version [5740](https://github.com/ggml-org/llama.cpp/releases/tag/b5740),
Findings so far are that removing one or two layers has a relatively moderate impact on quality. PPL and KLD suffer quite a lot, as expected considering that pruning changes the logits distribution, but the drop in inference quality, as reflected by tests' scores, is less pronounced.
For example, using the Q4_K_M variants as a benchmark, the average drop between eaddario/gemma-3-12b-it-pruned-GGUF and eaddario/gemma-3-12b-it-GGUF is < 3% (60.03 vs 61.65). Similar behaviour for eaddario/Qwen3-30B-A3B-pruned-GGUF and eaddario/Qwen3-30B-A3B-GGUF, albeit with a bit higher impact at ~5.5% (54.19 vs 57.36).
These results seem to confirm Xin Men's et al ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)
Another interesting side-effect, at least with Qwen3-30B-A3B, is that pruning 3 or more layers makes the model forget English and reply in Chinese! but with still reasonable answers.
As of version [5740](https://github.com/ggml-org/llama.cpp/releases/tag/b5740),
llama-quantize
now supports layer pruning via the --prune-layers
flag!Findings so far are that removing one or two layers has a relatively moderate impact on quality. PPL and KLD suffer quite a lot, as expected considering that pruning changes the logits distribution, but the drop in inference quality, as reflected by tests' scores, is less pronounced.
For example, using the Q4_K_M variants as a benchmark, the average drop between eaddario/gemma-3-12b-it-pruned-GGUF and eaddario/gemma-3-12b-it-GGUF is < 3% (60.03 vs 61.65). Similar behaviour for eaddario/Qwen3-30B-A3B-pruned-GGUF and eaddario/Qwen3-30B-A3B-GGUF, albeit with a bit higher impact at ~5.5% (54.19 vs 57.36).
These results seem to confirm Xin Men's et al ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)
Another interesting side-effect, at least with Qwen3-30B-A3B, is that pruning 3 or more layers makes the model forget English and reply in Chinese! but with still reasonable answers.