alexmarques's picture
Update README.md
6f677cc verified
|
raw
history blame
2.53 kB
metadata
tags:
  - sparse
  - fp8
  - vllm

Meta-Llama-3-8B-pruned_50.2of4-FP8

This repo contains model files for a 2:4 (N:M) sparse Meta-Llama-3-8B model pruned in one-shot with SparseGPT, and then additionally retrained with the SquareHead knowledge distillation while maintaining the 2:4 sparsity mask. It was then quantized using AutoFP8 to FP8 weights and activations with per-tensor scales, calibrated on UltraChat2k.

Note: The unquantized Meta-Llama-3-8B-pruned_50.2of4-FP8 is still a work in progress and subject to change. This FP8 model will be updated once the unquantized model is updated too.

Evaluation Benchmark Results

Model evaluation results obtained via lm-evaluation-harness following the configuration of Open LLM Leaderboard.

Benchmark Meta-Llama-3-8B Meta-Llama-3-8B-pruned_50.2of4 Meta-Llama-3-8B-pruned_50.2of4-FP8
(this model)
ARC-c
25-shot
59.47% 57.76% 58.02%
MMLU
5-shot
65.29% 60.44% 60.71%
HellaSwag
10-shot
82.14% 79.97% 79.61%
WinoGrande
5-shot
77.27% 77.19% 76.32%
GSM8K
5-shot
44.81% 47.92% 49.36%
TruthfulQA
0-shot
43.96% 41.02% 40.82%
Average
Accuracy
62.16% 60.72% 60.81%
Recovery 100% 97.68% 97.83%

Help

For further support, and discussions on these models and AI in general, join Neural Magic's Slack Community