---
language:
- en
library_name: transformers
license: llama2
---
# ProSparse-LLaMA-2-7B
- Model creator: [Meta](https://huggingface.co/meta-llama)
- Original model: [Llama 2 7B](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- Fine-tuned by: [THUNLP](https://nlp.csai.tsinghua.edu.cn/) and [ModelBest](modelbest.cn)
### Introduction
The utilization of activation sparsity, namely the existence of considerable weakly-contributed elements among activation outputs, is a promising method for inference acceleration of large language models (LLMs) ([Liu et al., 2023](https://proceedings.mlr.press/v202/liu23am/liu23am.pdf); [Song et al., 2023](https://arxiv.org/pdf/2312.12456.pdf)). Concretely, acceleration methods based on activation sparsity usually achieve higher inference speed by making wiser resource allocation and computation policies to avoid resource waste on these weakly-contributed parameters.
Adopting ReLU as the activation function is a straightforward method to achieve activation sparsity. However, most recent mainstream LLMs adopt activation functions without intrinsic sparsity (e.g., GELU and Swish). Some efforts ([Zhang et al., 2022](https://aclanthology.org/2022.findings-acl.71.pdf); [Mirzadeh et al., 2023](https://arxiv.org/pdf/2310.04564.pdf); [Zhang et al., 2024](https://arxiv.org/pdf/2402.03804.pdf)) introduce ReLU or its variants as the substitutive activation function to help non-ReLU LLMs achieve activation sparsity and inference acceleration, but few can concurrently obtain high sparsity and comparable task-specific performance.
In this work, we introduce a lossless sparsification method named "ProSparse" to push for higher activation sparsity without performance degradation. By applying ProSparse to Swish-activated LLaMA2-7B and LLaMA2-13B, we obtain ReLU-activated LLaMA2 models with high sparsity of 89.32% and 88.80% respectively while their performance is comparable to the original version. Further inference acceleration experiments demonstrate the practical speedup effects of higher sparsity on both [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf) and our two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator).
### Training Dataset
We train the 7B model on about 34.60 billion tokens within 16,500 steps, including a mixture of the following two categories of data.
- Language modeling datasets:
* StarCoder
* Wikipedia
* Pile
* Other collected datasets
- Instruction tuning datasets:
- UltraChat
- P3 (multiple-choice QA)
- PAQ
- Unnatural Instructions
- Flan
- Super-Natural Instructions
- Other collected datasets
Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.
### ProSparse: Training Methodology
The training process of ProSparse consists of three steps (refer to Section 3.2 of [paper](TODO) for more details):
1. **Activation Function Substitution**: We substituting the activation function of FFNs with ReLU and applying continual training;
2. **Progressive Sparsity Regularization**: We jointly optimize the model on the conventional next-token prediction loss and \\(L_1\\) regularization loss. The regularization is applied to the sparse intermediate outputs of FFNs with a regularization factor increasing progressively in multiple stages. Specifically, the regularization factor \\(\lambda\\) is set to a small constant for the warmup stage, and then increases along a smooth sine curve for each of the subsequent incremental stages. Each stage is accompanied by certain steps of training. In this way, the model can have more time to adapt to the increasing regularization without radical activation shifts, thus alleviating performance degradation.
3. **Activation Threshold Shifting**: We finally replace ReLU with FATReLU ([Kurtz et al., 2020](https://proceedings.mlr.press/v119/kurtz20a/kurtz20a.pdf)), a ReLU variant with a positive threshold. This can prune those non-zero weakly-contributed elements in activation outputs and further boost sparsity.
The 7B model is trained on 8 A100 GPUs. The learning rate (LR) is controlled by a cosine scheduler with a peak LR of \\(3e-5\\). The hyper-parameters for each stage (including the regularization factor \\(\lambda_i\\), the accumulated training steps \\(T_i\\), and the accumulated training tokens) are shown as follows:
| Step Number \\(i\\) | \\(\lambda_i\\) | \\(T_i\\) | Accumulated Tokens (B) |
| :-------------: | :---------: | :----: | :--------------------: |
| 0 | 0 | 5,000 | 10.49 |
| 1 | \\(5e-3\\) | 6,000 | 12.58 |
| 2 | \\(5e-2\\) | 10,000 | 20.97 |
| 3 | \\(5e-2\\) | 12,000 | 25.17 |
| 4 | \\(5e-1\\) | 16,000 | 33.55 |
| 5 | \\(5e-1\\) | 16,500 | 34.60 |
### Evaluation Benckmarks
- **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
- **Commonsense Reasoning**: We report the average 0-shot perplexity (PPL) on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
- **Reading Comprehension**: We compute the average 0-shot PPL on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and the average PPL on AGI-Eval (0-shot).
### Evaluation Results
The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Refer to Section 4.2 of [paper](TODO) for more details.
| Setting | Average
Sparsity | Code
Generation | Commonsense
Reasoning | Reading
Comprehension | GSM8K | MMLU | BBH | AGI Eval | Average |
| :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
| Original-7B | - | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 | 37.96 |
| ReluLLaMA-7B | 66.98 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 | 37.62 |
| Vanilla ReLU-7B | 66.04 | 21.31 | 70.73 | 73.22 | 11.22 | 49.22 | 36.11 | 28.01 | 41.40 |
| Shifted ReLU-7B | 69.59 | 20.50 | 70.09 | 73.17 | 13.87 | 48.54 | 35.20 | 27.94 | 41.33 |
| Fixed \\(L_1\\)-7B | 91.46 | 18.85 | 66.01 | 55.39 | 2.27 | 32.28 | 31.40 | 26.48 | 33.24 |
| **ProSparse-7B**\* | 88.11 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 | 38.31 |
| **ProSparse-7B** | 89.32 | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 | 38.46 |
| Original-13B | - | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 | 44.06 |
| ReluLLaMA-13B | 71.56 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 | 42.74 |
| **ProSparse-13B**\* | 87.97 | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 | 45.07 |
| **ProSparse-13B** | 88.80 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 | 44.90 |
**Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. "ProSparse-7B\*" and "ProSparse-13B\*" denote the ProSparse versions without activation threshold shifting.
### Inference Acceleration Effects
First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs).
Moreover, considering the potential inference inaccuracies caused by wrong predictions of activation predictors, we implement two sparse GPU [operators](https://github.com/Raincleared-Song/sparse_gpu_operator) for faster accurate inference utilizing activation sparsity. They are responsible for the speedup of two key steps in a gated FFN:
- Step (2) (`S2`): a fused operator of ReLU and \\(\mathbf{s} \odot (\mathbf{x} \mathbf{W}_1^T)\\);
- Step (3) (`S3`): a sparse matrix-vector multiplication operator \\(\mathbf{x}_1 \mathbf{W}_2^T\\).
where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) denote the gating scores, the FFN input hidden states, the intermediate outputs, and the element-wise multiplication respectively. \\(\mathbf{W}_1\\) and \\(\mathbf{W}_2\\) are FFN weight matrices.
The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](TODO) for more details.
| Setting | Average
Sparsity | Activation
Recall | Predicted
Sparsity | PowerInfer
Speed | `S2`
Time | `S2`
Speedup | `S3`
Time | `S3`
Speedup |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
| ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 67.12 | 1.35 | 63.00 | 1.32 |
| Vanilla ReLU-7B | 66.04 | 87.72 | 72.57 | 12.04 | 67.85 | 1.33 | 63.28 | 1.31 |
| Fixed \\(L_1\\)-7B | 91.46 | 94.51 | 82.85 | 19.62 | 40.99 | 2.21 | 54.19 | 1.53 |
| **ProSparse-7B**\* | 88.11 | 93.46 | 75.24 | 16.30 | 46.66 | 1.94 | 55.56 | 1.49 |
| **ProSparse-7B** | 89.32 | 92.34 | 78.75 | - | 45.38 | 2.00 | 55.05 | 1.51 |
| ReluLLaMA-13B | 71.56 | 86.41 | 71.93 | 6.59 | 69.92 | 1.88 | 75.47 | 1.51 |
| **ProSparse-13B**\* | 87.97 | 91.02 | 77.93 | 8.67 | 55.29 | 2.38 | 67.50 | 1.68 |
| **ProSparse-13B** | 88.80 | 91.11 | 78.28 | - | 53.78 | 2.44 | 66.73 | 1.70 |
**Notes**: Fixed \\(L_1\\) suffers from severe performance degradation. ProSparse with Activation Threshold Shifting is not supported by PowerInfer. "Time" means the average wall-clock time (us) cost by each step with our sparse GPU operators, and "Speedup" is the speedup ratio to the setting without operators. The average time for step (2) and (3) without sparse GPU operators is about **90.55 and 82.92 (us) for 7B, 131.36 and 113.68 (us) for 13B** respectively under all sparsity.
### License Disclaimer
This model is bound by the license & usage restrictions of the original Llama-2 model and comes with no warranty or guarantees of any kind.
### Limitations & Biases
Llama 2 and fine-tuned variants are a new technology that carries risks with use. Testing conducted to date has been in English, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Llama 2 and any fine-tuned variant's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 2 variants, developers should perform safety testing and tuning tailored to their specific applications of the model.
Please see the Responsible Use Guide available at https://ai.meta.com/llama/responsible-use-guide/
### Citation
Please kindly cite using the following BibTeX:
```bibtex
@article{song2024prosparse,
title={{ProSparse}: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models},
author={Song, Chenyang and Han, Xu and Zhang, Zhengyan and Hu, Shengding and Shi, Xiyu and Li, Kuai and Chen, Chen and Liu, Zhiyuan and Li, Guangli and Yang, Tao and Sun, Maosong},
year={2024},
}
```
#### Acknowledgments
The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B).