@eaddario on Hugging Face: "Model pruning for the masses! As of version…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

eaddario

posted an update 12 days ago

Post

1696

Model pruning for the masses!

As of version [5740](https://github.com/ggml-org/llama.cpp/releases/tag/b5740), llama-quantize now supports layer pruning via the --prune-layers flag!

Findings so far are that removing one or two layers has a relatively moderate impact on quality. PPL and KLD suffer quite a lot, as expected considering that pruning changes the logits distribution, but the drop in inference quality, as reflected by tests' scores, is less pronounced.

For example, using the Q4_K_M variants as a benchmark, the average drop between eaddario/gemma-3-12b-it-pruned-GGUF and eaddario/gemma-3-12b-it-GGUF is < 3% (60.03 vs 61.65). Similar behaviour for eaddario/Qwen3-30B-A3B-pruned-GGUF and eaddario/Qwen3-30B-A3B-GGUF, albeit with a bit higher impact at ~5.5% (54.19 vs 57.36).

These results seem to confirm Xin Men's et al ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853)

Another interesting side-effect, at least with Qwen3-30B-A3B, is that pruning 3 or more layers makes the model forget English and reply in Chinese! but with still reasonable answers.

adjaysagar

11 days ago

Quite an interesting find - very similar to how this team reports - https://www.reddit.com/r/LocalLLaMA/comments/1l44lw8/sparse_transformers_run_2x_faster_llm_with_30/

In this post