Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update Aug 17
Post
1775
Supercool Weekend Read🤖
Nvidia researchers achieved SOTA LLM compression metrics using pruning and knowledge distillation techniques.

Details on Techniques (Simplified):
They started off with a large pre-trained language model (15B params), then:

1. Estimated the importance of different parts of the model (neurons, attention heads, layers) using activation-based metrics on a small calibration dataset.

2. Pruned (remove) less important parts of the model to reduce its size.

3. Retrained the pruned model using knowledge distillation, where the original large model acts as a teacher for the smaller pruned model.

4. Used a lightweight neural architecture search to find the best configuration for the pruned model.

5. Repeated this process iteratively to create even smaller models.

Cool, giving it a try this weekend 😎
Code: https://github.com/NVlabs/Minitron
Paper: https://arxiv.org/abs/2407.14679
Demo: nvidia/minitron
In this post