SparseLlama-3-8B-pruned_50.2of4-FP8
This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask. It was then quantized using AutoFP8 to FP8 weights and activations with per-tensor scales, calibrated on UltraChat2k.
Note: The unquantized SparseLlama-3-8B-pruned_50.2of4-FP8 is still a work in progress and subject to change. This FP8 model will be updated once the unquantized model is updated too.
Evaluation Benchmark Results
Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.
Benchmark | Meta-Llama-3-8B | SparseLlama-3-8B-pruned_50.2of4 | SparseLlama-3-8B-pruned_50.2of4-FP8 (this model) |
---|---|---|---|
ARC-c 25-shot |
59.47% | 57.76% | 58.02% |
MMLU 5-shot |
65.29% | 60.44% | 60.71% |
HellaSwag 10-shot |
82.14% | 79.97% | 79.61% |
WinoGrande 5-shot |
77.27% | 77.19% | 76.32% |
GSM8K 5-shot |
44.81% | 47.92% | 49.36% |
TruthfulQA 0-shot |
43.96% | 41.02% | 40.82% |
Average Accuracy |
62.16% | 60.72% | 60.81% |
Recovery | 100% | 97.68% | 97.83% |
Help
For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community
- Downloads last month
- 18
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.