Text Generation
Safetensors
English
Chinese
plm
conversational
custom_code
daven3 commited on
Commit
62d188c
·
verified ·
1 Parent(s): d32df61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -32
README.md CHANGED
@@ -1,13 +1,15 @@
1
  ---
2
- base_model: PLM-Team/PLM-1.8B-Instruct
 
 
 
 
3
  language:
4
  - en
5
  - zh
6
- library_name: transformers
7
- license: apache-2.0
8
- quantized_by: PLM-Team
9
  pipeline_tag: text-generation
10
  ---
 
11
  <center>
12
  <img src="https://www.cdeng.net/plm/plm_logo.png" alt="plm-logo" width="200"/>
13
  <h2>🖲️ PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing</h2>
@@ -26,43 +28,102 @@ pipeline_tag: text-generation
26
 
27
  The PLM (Peripheral Language Model) series introduces a novel model architecture to peripheral computing by delivering powerful language capabilities within the constraints of resource-limited devices. Through modeling and system co-design strategy, PLM optimizes model performance and fits edge system requirements, PLM employs **Multi-head Latent Attention** and **squared ReLU** activation to achieve sparsity, significantly reducing memory footprint and computational demands. Coupled with a meticulously crafted training regimen using curated datasets and a Warmup-Stable-Decay-Constant learning rate scheduler, PLM demonstrates superior performance compared to existing small language models, all while maintaining the lowest activated parameters, making it ideally suited for deployment on diverse peripheral platforms like mobile phones and Raspberry Pis.
28
 
 
 
29
 
30
- **Here we present the static quants of https://huggingface.co/PLM-Team/PLM-1.8B-Instruct**
31
 
32
- ## Provided Quants
33
 
34
- | Link | Type | Size/GB | Notes |
35
- |:-----|:-----|--------:|:------|
36
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-F16.gguf|F16| 3.66GB| Recommanded|
37
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q2_K.gguf|Q2_K| 827 MB| |
38
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q3_K_L.gguf|Q3_K_L| 1.09 GB| |
39
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q3_K_M.gguf|Q3_K_M| 1.01 GB| |
40
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q3_K_S.gguf|Q3_K_S| 912 MB| |
41
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q4_0.gguf|Q4_0| 1.11 GB| |
42
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q4_1.gguf|Q4_1| 1.21 GB| |
43
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q4_K_M.gguf|Q4_K_M| 1.18 GB| Recommanded|
44
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q4_K_S.gguf|Q4_K_S| 1.12 GB| |
45
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q5_0.gguf|Q5_0| 1.3 GB| |
46
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q5_1.gguf|Q5_1| 1.4 GB| |
47
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q5_K_M.gguf|Q5_K_M| 1.34 GB| |
48
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q5_K_S.gguf|Q5_K_S| 1.3 GB| |
49
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q6_K.gguf|Q6_K| 1.5 GB| |
50
- |https://huggingface.co/PLM-Team/PLM-1.8B-Instruct-gguf/blob/main/PLM-1.8B-Instruct-Q8_0.gguf|Q8_0| 1.95 GB| Recommanded|
51
 
52
- ## Usage (llama.cpp)
53
 
54
- Now [llama.cpp](https://github.com/ggml-org/llama.cpp) supports our model. Here is the usage:
55
 
56
- ```bash
57
- git clone https://github.com/Si1w/llama.cpp.git
58
- cd llama.cpp
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ```
60
 
61
- If you want to convert the orginal model into `gguf` form by yourself, you can
 
 
62
 
63
  ```bash
 
 
64
  pip install -r requirements.txt
65
- python convert_hf_to_ggyf.py [model] --outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}
66
  ```
67
 
68
  Then, we can build with CPU of GPU (e.g. Orin). The build is based on `cmake`.
@@ -93,9 +154,23 @@ After build the `llama.cpp`, we can use `llama-cli` script to launch the PLM.
93
  ./build/bin/llama-cli -m ./PLM-Team/PLM-1.8B-Instruct-gguf/PLM-1.8B-Instruct-Q8_0.gguf -cnv -p "hello!" -n 128
94
  ```
95
 
96
- ## Citation
 
 
 
 
97
 
98
- If you find Project PLM helpful for your research or applications, please cite as follows:
 
 
 
 
 
 
 
 
 
 
99
 
100
  ```
101
  @misc{deng2025plmefficientperipherallanguage,
 
1
  ---
2
+ license: mit
3
+ datasets:
4
+ - HuggingFaceFW/fineweb-edu
5
+ - mlfoundations/dclm-baseline-1.0
6
+ - BAAI/CCI3-HQ
7
  language:
8
  - en
9
  - zh
 
 
 
10
  pipeline_tag: text-generation
11
  ---
12
+
13
  <center>
14
  <img src="https://www.cdeng.net/plm/plm_logo.png" alt="plm-logo" width="200"/>
15
  <h2>🖲️ PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing</h2>
 
28
 
29
  The PLM (Peripheral Language Model) series introduces a novel model architecture to peripheral computing by delivering powerful language capabilities within the constraints of resource-limited devices. Through modeling and system co-design strategy, PLM optimizes model performance and fits edge system requirements, PLM employs **Multi-head Latent Attention** and **squared ReLU** activation to achieve sparsity, significantly reducing memory footprint and computational demands. Coupled with a meticulously crafted training regimen using curated datasets and a Warmup-Stable-Decay-Constant learning rate scheduler, PLM demonstrates superior performance compared to existing small language models, all while maintaining the lowest activated parameters, making it ideally suited for deployment on diverse peripheral platforms like mobile phones and Raspberry Pis.
30
 
31
+ ---
32
+ ## News
33
 
34
+ > The paper **"PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing"** has been released!
35
 
36
+ ## PLM Roadmap
37
 
38
+ <center>
39
+ <img src="https://www.cdeng.net/plm/pipe.png" width="100%"/>
40
+ </center>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
+ ## PLM Hightlight
43
 
44
+ PLM demonstrates highly competitive performance along with a series of advantages stemming from its modeling and system co-design. These benefits include impressive inference speed, extreme sparsity, and reduced KV cache due to MLA, enabling it to outperform models with the same number of layers when handling long-context inference tasks at certain sequence lengths.
45
 
46
+
47
+ - **Sparse** (Less activated parameters but better performance)
48
+
49
+ <div align="center">
50
+ <img src="https://www.cdeng.net/plm/sparse_compare.png" width="50%"/>
51
+ </div>
52
+
53
+ - **High efficiency** (Generate content with low latency while having a good quality)
54
+
55
+ <center>
56
+ <img src="https://www.cdeng.net/plm/latency/latency_all.png" width="100%"/>
57
+ </center>
58
+
59
+ - **Low kv-cache** on long-context processing leads to a low latency when inference with long sequences.
60
+
61
+ |||
62
+ |:-:|:-:|
63
+ |<img src="https://www.cdeng.net/plm/latency/prefill_eff.png"/>|<img src="https://www.cdeng.net/plm/latency/decode_eff.png"/>|
64
+
65
+ - **More efficiency** when layer-wise loading.
66
+
67
+ |||
68
+ |:-:|:-:|
69
+ |<img src="https://www.cdeng.net/plm/latency/prefill_ngl.png"/>|<img src="https://www.cdeng.net/plm/latency/decode_ngl.png"/>|
70
+
71
+ ## Performance
72
+
73
+ PLM-1.8B is a strong and reliable model, particularly in basic knowledge understanding, coding and simple reasoning tasks.
74
+
75
+ <center>
76
+
77
+ | **Benchmarks** | PLM-Instruct | MiniCPM | Yulan-Mini | SmolLM2 | Qwen2.5 | Qwen2 | GLM-Edge |
78
+ |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
79
+ | **ARC-C** | <u>51.14</u> | 43.86 | 50.51 | 50.29 | **53.41** | 43.90 | 24.15 |
80
+ | **ARC-E** | <u>78.18</u> | 55.51 | 69.87 | 77.78 | **79.13** | 62.21 | 36.83 |
81
+ | **MMLU** | 51.18 | 51.13 | 49.10 | 51.91 | **59.79** | <u>56.50</u> | 54.84 |
82
+ | **CMMLU** | 48.18 | 48.97 | 48.35 | 33.46 | <u>67.82</u> | **70.30** | 54.23 |
83
+ | **C-Eval** | 44.93 | 48.24 | 51.47 | 35.10 | <u>69.05</u> | **70.60** | 55.05 |
84
+ | **GSM8K** | 60.73 | 53.83 | <u>66.65</u> | 47.68 | **68.50** | 46.90 | 54.89 |
85
+ | **MathQA** | 33.23 | 30.59 | <u>34.84</u> | 34.30 | **35.14** | 31.66 | 33.94 |
86
+ | **HumanEval** | **64.60** | 50.00 | <u>61.60</u> | 23.35 | 37.20 | 34.80 | 1.21 |
87
+ | **MBPP** | <u>60.40</u> | 47.31 | **66.70** | 45.00 | 60.20 | 46.90 | 3.44 |
88
+ | **BoolQ** | <u>77.86</u> | 73.55 | 70.89 | 72.26 | 72.91 | 72.69 | 60.95 |
89
+ | **Hellaswag** | 68.17 | 53.06 | <u>71.47</u> | **71.48** | 67.73 | 65.41 | 29.39 |
90
+ | **LogiQA** | 30.12 | **31.64** | 29.65 | 29.65 | <u>31.03</u> | 31.02 | 22.73 |
91
+ | **PIQA** | 76.01 | 77.04 | 76.50 | 77.04 | **76.01** | <u>75.35</u> | 74.32 |
92
+ | **Average** | **57.29 (3rd)** | 51.13 | **57.51 (2nd)** | 49.95 | **59.84 (1st)** | 54.48 | 38.92 |
93
+
94
+ </center>
95
+
96
+ ## How to use PLM
97
+
98
+ Here we introduce some methods to use PLM models.
99
+
100
+ ### Hugging Face
101
+
102
+ ```python
103
+ import torch
104
+ from transformers import AutoTokenizer, AutoModelForCausalLM
105
+
106
+ # Load model and tokenizer
107
+ tokenizer = AutoTokenizer.from_pretrained("PLM-Team/PLM-1.8B-Instruct")
108
+ model = AutoModelForCausalLM.from_pretrained("PLM-Team/PLM-1.8B-Instruct", torch_dtype=torch.bfloat16)
109
+
110
+ # Input text
111
+ input_text = "Tell me something about reinforcement learning."
112
+ inputs = tokenizer(input_text, return_tensors="pt")
113
+
114
+ # Completion
115
+ output = model.generate(inputs["input_ids"], max_new_tokens=100)
116
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
117
  ```
118
 
119
+ ### llama.cpp
120
+
121
+ The original contribution to the llama.cpp framwork is [Si1w/llama.cpp](https://github.com/Si1w/llama.cpp). Here is the usage:
122
 
123
  ```bash
124
+ git clone https://github.com/Si1w/llama.cpp.git
125
+ cd llama.cpp
126
  pip install -r requirements.txt
 
127
  ```
128
 
129
  Then, we can build with CPU of GPU (e.g. Orin). The build is based on `cmake`.
 
154
  ./build/bin/llama-cli -m ./PLM-Team/PLM-1.8B-Instruct-gguf/PLM-1.8B-Instruct-Q8_0.gguf -cnv -p "hello!" -n 128
155
  ```
156
 
157
+ ## Future works
158
+
159
+ - [ ] Release vLLM, SGLang, and PowerInfer inference scripts for PLM.
160
+ - [ ] Release reasoning model trained on PLM.
161
+ - [ ] Release vision model based on PLM.
162
 
163
+ ## Acknowledgements
164
+
165
+ We sincerely thank Deepseek for its contributions to the community through the MLA architecture and the PowerInfer project for inspiring our model architecture design. We are grateful to Yixin Song, Yan Song, and Yang Li for their insightful suggestions throughout the project. We also acknowledge the ADC of the Hong Kong University of Science and Technology (Guangzhou) for providing essential computing resources. Finally, we extend our deepest appreciation to our team members for their dedication and contributions from September 2024 to the present.
166
+
167
+ ## License
168
+ The code in this repository is released under the MIT License.
169
+ Limitations: While we strive to address safety concerns and promote the generation of ethical and lawful text, the probabilistic nature of language models may still produce unforeseen outputs. These may include biased, discriminatory, or otherwise harmful content. Users are advised not to disseminate such material. We disclaim any liability for consequences resulting from the distribution of harmful information.
170
+
171
+
172
+ ## Citation
173
+ If you find **Project PLM** helpful for your research or applications, please cite as follows:
174
 
175
  ```
176
  @misc{deng2025plmefficientperipherallanguage,