Update README.md
Browse files
README.md
CHANGED
@@ -6,6 +6,7 @@ tags:
|
|
6 |
- MiniCPM
|
7 |
- ModelBest
|
8 |
- THUNLP
|
|
|
9 |
---
|
10 |
|
11 |
|
@@ -25,7 +26,7 @@ In this work, we introduce a simple and effective sparsification method named "P
|
|
25 |
|
26 |
### Training Dataset
|
27 |
|
28 |
-
We train the 1B model on about 473.02 billion tokens within 101,000 steps. These consist of 35,000 steps for standard ProSparse pre-training,
|
29 |
|
30 |
Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.
|
31 |
|
@@ -46,8 +47,8 @@ The hyper-parameters for each stage (including the regularization factor \\(\lam
|
|
46 |
| 2 | \\(5e-3\\) | 20,000 | 98.30 |
|
47 |
| 3 | \\(5e-3\\) | 25,000 | 122.88 |
|
48 |
| 4 | \\(5e-2\\) | 35,000 | 172.03 |
|
49 |
-
| decay | \\(5e-2\\)(fixed) | 95,000 | 466.94 |
|
50 |
-
| SFT | \\(1e-2\\)(fixed) | 101,000 | 473.02 |
|
51 |
|
52 |
### Evaluation Results
|
53 |
|
@@ -63,19 +64,19 @@ The evaluation results on the above benchmarks demonstrate the advantage of ProS
|
|
63 |
|
64 |
**Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
|
65 |
|
66 |
-
| Setting | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval |
|
67 |
| :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
|
68 |
-
| LLaMA2-7B | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 |
|
69 |
-
| ReluLLaMA-7B | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 |
|
70 |
-
| **ProSparse-7B**\* | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 |
|
71 |
-
| **ProSparse-7B** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 |
|
72 |
-
| LLaMA2-13B | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 |
|
73 |
-
| ReluLLaMA-13B | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 |
|
74 |
-
| **ProSparse-13B**\* | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 |
|
75 |
-
| **ProSparse-13B** | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 |
|
76 |
-
| MiniCPM-1B | 36.85
|
77 |
-
| **ProSparse-1B**\* | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 |
|
78 |
-
| **ProSparse-1B** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 |
|
79 |
|
80 |
**Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "ProSparse-1B\*" denote the ProSparse versions without activation threshold shifting.
|
81 |
|
@@ -114,7 +115,7 @@ where \\(\mathbf{s}\\), \\(\mathbf{x}\\), \\(\mathbf{x}_1\\), and \\(\odot\\) de
|
|
114 |
|
115 |
The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
|
116 |
|
117 |
-
| Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time | Speedup<br>to Dense | `S3`<br/>Time | Speedup<br/>to Dense |
|
118 |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
|
119 |
| Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
|
120 |
| ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 3.10 | 67.12 | 1.35 | 63.00 | 1.32 |
|
@@ -163,4 +164,4 @@ Therefore, when using content generated by MiniCPM, users should take full respo
|
|
163 |
|
164 |
#### Acknowledgments
|
165 |
|
166 |
-
The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16).
|
|
|
6 |
- MiniCPM
|
7 |
- ModelBest
|
8 |
- THUNLP
|
9 |
+
license: apache-2.0
|
10 |
---
|
11 |
|
12 |
|
|
|
26 |
|
27 |
### Training Dataset
|
28 |
|
29 |
+
We train the 1B model on about 473.02 billion tokens within 101,000 steps. These consist of 35,000 steps for standard ProSparse pre-training, 60,000 steps for decay, and 6,000 steps for SFT. Except for ProSparse, other training settings are highly consistent with the original [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). Refer to our [paper](https://arxiv.org/pdf/2402.13516.pdf) and [MiniCPM technical report](https://arxiv.org/pdf/2404.06395) for more details.
|
30 |
|
31 |
Intuitively, training the model with even more tokens or with data of a wider coverage and higher quality will obtain better task-specific performance.
|
32 |
|
|
|
47 |
| 2 | \\(5e-3\\) | 20,000 | 98.30 |
|
48 |
| 3 | \\(5e-3\\) | 25,000 | 122.88 |
|
49 |
| 4 | \\(5e-2\\) | 35,000 | 172.03 |
|
50 |
+
| decay | \\(5e-2\\) (fixed) | 95,000 | 466.94 |
|
51 |
+
| SFT | \\(1e-2\\) (fixed) | 101,000 | 473.02 |
|
52 |
|
53 |
### Evaluation Results
|
54 |
|
|
|
64 |
|
65 |
**Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
|
66 |
|
67 |
+
| Setting | Average<br>Sparsity | Average<br>Performance | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU | BBH | AGI Eval |
|
68 |
| :-------------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: | :-----------------: |
|
69 |
+
| LLaMA2-7B | - | 37.96 | 16.37 | 69.59 | 61.87 | 12.96 | 44.45 | 32.96 | 27.53 |
|
70 |
+
| ReluLLaMA-7B | 66.98 | 37.62 | 15.85 | 69.64 | 70.54 | 5.84 | 38.64 | 35.07 | 27.73 |
|
71 |
+
| **ProSparse-7B**\* | 88.11 | 38.31 | 19.47 | 66.29 | 63.33 | 12.74 | 45.21 | 33.59 | 27.55 |
|
72 |
+
| **ProSparse-7B** | **89.32** | **38.46** | 19.42 | 66.27 | 63.50 | 12.13 | 45.48 | 34.99 | 27.46 |
|
73 |
+
| LLaMA2-13B | - | 44.06 | 20.19 | 72.58 | 71.55 | 22.21 | 54.69 | 37.89 | 29.33 |
|
74 |
+
| ReluLLaMA-13B | 71.56 | 42.74 | 20.19 | 70.44 | 73.29 | 18.50 | 50.58 | 37.97 | 28.22 |
|
75 |
+
| **ProSparse-13B**\* | 87.97 | **45.07** | 29.03 | 69.75 | 67.54 | 25.40 | 54.78 | 40.20 | 28.76 |
|
76 |
+
| **ProSparse-13B** | **88.80** | 44.90 | 28.42 | 69.76 | 66.91 | 26.31 | 54.35 | 39.90 | 28.67 |
|
77 |
+
| MiniCPM-1B | - | 44.44 | 36.85 | 63.67 | 60.90 | 35.48 | 50.44 | 35.03 | 28.71 |
|
78 |
+
| **ProSparse-1B**\* | 86.25 | **44.72** | 41.38 | 64.55 | 60.69 | 34.72 | 49.36 | 34.04 | 28.27 |
|
79 |
+
| **ProSparse-1B** | **87.89** | **44.72** | 42.04 | 64.37 | 60.73 | 34.57 | 49.51 | 34.08 | 27.77 |
|
80 |
|
81 |
**Notes**: "Original" refers to the original Swish-activated LLaMA2 versions. ReluLLaMA-7B and ReluLLaMA-13B are available at [7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [13B](https://huggingface.co/SparseLLM/ReluLLaMA-13B) respectively. MiniCPM-1B is available at [1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16). "ProSparse-7B\*", "ProSparse-13B\*", and "ProSparse-1B\*" denote the ProSparse versions without activation threshold shifting.
|
82 |
|
|
|
115 |
|
116 |
The acceleration effects of LLMs with different sparsity are displayed as follows. ProSparse, which reaches a high sparsity without performance degradation, can gain the most benefits among all the settings concerned. Refer to Section 4.3 of [paper](https://arxiv.org/pdf/2402.13516.pdf) for more details.
|
117 |
|
118 |
+
| Setting | Average<br>Sparsity | Activation<br>Recall | Predicted<br>Sparsity | PowerInfer<br>Speed | Speedup<br>to Dense | `S2`<br>Time \\((\downarrow)\\) | Speedup<br>to Dense | `S3`<br/>Time \\((\downarrow)\\) | Speedup<br/>to Dense |
|
119 |
| :-------------------: | :-----------------: | :------------------: | :-------------------: | :-----------------: | :-----------------: | :--------------: | :-----------------: | :---------------: | :------------------: |
|
120 |
| Dense-7B | - | - | - | 3.67 | 1.00 | 90.55 | 1.00 | 82.92 | 1.00 |
|
121 |
| ReluLLaMA-7B | 66.98 | 90.89 | 58.95 | 11.37 | 3.10 | 67.12 | 1.35 | 63.00 | 1.32 |
|
|
|
164 |
|
165 |
#### Acknowledgments
|
166 |
|
167 |
+
The model card is modified from [ReluLLaMA-7B](https://huggingface.co/SparseLLM/ReluLLaMA-7B) and [MiniCPM-1B](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16).
|