Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
eaddario 
posted an update 12 days ago
Post
1695
Model pruning for the masses!

As of version [5740](https://github.com/ggml-org/llama.cpp/releases/tag/b5740), llama-quantize now supports layer pruning via the --prune-layers flag!

Findings so far are that removing one or two layers has a relatively moderate impact on quality. PPL and KLD suffer quite a lot, as expected considering that pruning changes the logits distribution, but the drop in inference quality, as reflected by tests' scores, is less pronounced.

For example, using the Q4_K_M variants as a benchmark, the average drop between eaddario/gemma-3-12b-it-pruned-GGUF and eaddario/gemma-3-12b-it-GGUF is < 3% (60.03 vs 61.65). Similar behaviour for eaddario/Qwen3-30B-A3B-pruned-GGUF and eaddario/Qwen3-30B-A3B-GGUF, albeit with a bit higher impact at ~5.5% (54.19 vs 57.36).

These results seem to confirm Xin Men's et al ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)

Another interesting side-effect, at least with Qwen3-30B-A3B, is that pruning 3 or more layers makes the model forget English and reply in Chinese! but with still reasonable answers.

Quite an interesting find - very similar to how this team reports - https://www.reddit.com/r/LocalLLaMA/comments/1l44lw8/sparse_transformers_run_2x_faster_llm_with_30/