File size: 7,654 Bytes
e4ac2c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---
library_name: transformers
license: apache-2.0
base_model:
- HuggingFaceTB/SmolLM2-360M-Instruct
tags:
- HuggingFaceTB
- SmolLM2
- SmolLM2-360M-Instruct
- Int8
- M5Stack
- RaspberryPi 5
language:
- en
---

# SmolLM2-360M-Instruct

![image/png](https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/oWWfzW4RbWkVIo7f-5444.png)

This version of SmolLM2-360M-Instruct has been converted to run on the Axera NPU using **w8a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 3.4(Not released yet)

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo
https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/internvl2) 

[AXera NPU AXCL LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/axcl-llm-internvl)

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
  - [爱芯派2](https://axera-pi-2-docs-cn.readthedocs.io/zh-cn/latest/index.html)
  - [Module-LLM](https://docs.m5stack.com/zh_CN/module/Module-LLM)
  - [LLM630 Compute Kit](https://docs.m5stack.com/zh_CN/core/LLM630%20Compute%20Kit)
 
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 39 tokens/sec|todo|
|AX630C| 14 tokens/sec|todo|

## How to use

Download all files from this repository to the device

```
root@ax650:/mnt/qtang/llm-test/smollm2-360m# tree -L 1
.
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- main_prefill
|-- post_config.json
|-- run_smollm2_360m_ax630c.sh
|-- run_smollm2_360m_ax650.sh
|-- run_smollm2_360m_axcl_aarch64.sh
|-- run_smollm2_360m_axcl_x86.sh
|-- smollm2-360m-ax630c
|-- smollm2-360m-ax650
|-- smollm2_tokenizer
`-- smollm2_tokenizer.py
```

### Install transformer

```
pip install transformers==4.41.1
```

### Start the Tokenizer service

```
root@ax650:/mnt/qtang/llm-test/smollm2-360m$ python smollm2_tokenizer.py --port 12345
1 <|im_start|> 2 <|im_end|>
<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant

[1, 9690, 198, 2683, 359, 253, 5356, 5646, 11173, 3365, 3511, 308, 34519, 28, 7018, 411, 407, 19712, 8182, 2, 198, 1, 4093, 198, 28120, 905, 2, 198, 1, 520, 9531, 198]
http://localhost:12345
```

### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run `run_smollm2_360m_ax650.sh`

```
root@ax650:/mnt/qtang/llm-test/smollm2-360m# ./run_smollm2_360m_ax650.sh
[I][                            Init][ 125]: LLM init start
bos_id: 1, eos_id: 2
  2% | β–ˆ                                 |   1 /  35 [0.00s<0.14s, 250.00 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ |  35 /  35 [0.81s<0.81s, 43.37 count/s] init post axmodel ok,remain_cmm(3339 MB)
[I][                            Init][ 241]: max_token_len : 1023
[I][                            Init][ 246]: kv_cache_size : 320, kv_cache_num: 1023
[I][                            Init][ 254]: prefill_token_num : 128
[I][                     load_config][ 281]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you?
[I][                             Run][ 466]: ttft: 156.63 ms
I'm a chatbot developed by the Artificial Intelligence Research and Development Lab (AI R&D Lab) at Hugging Face Labs,
specifically designed to facilitate and augment human-AI conversations. My role is to provide assistance in understanding
 and responding to natural language queries, using advanced language models and AI algorithms to understand context and intent.

[N][                             Run][ 605]: hit eos,avg 38.70 token/s

>> q

```

### Inference with M.2 Accelerator card

[What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.

```
(base) axera@raspberrypi:~/samples/smollm2-360m $ ./run_smollm2_360m_axcl_aarch64.sh
build time: Feb 13 2025 15:44:57
[I][                            Init][ 111]: LLM init start
bos_id: 1, eos_id: 2
100% | β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ |  35 /  35 [18.07s<18.07s, 1.94 count/s] init post axmodel okremain_cmm(6621 MB)
[I][                            Init][ 226]: max_token_len : 1023
[I][                            Init][ 231]: kv_cache_size : 320, kv_cache_num: 1023
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you?

I'm a virtual AI assistant, designed to support users with their questions and tasks.
I was trained on a vast dataset of text, including text from various sources and
conversations. This extensive training allows me to understand and respond to a wide range of queries.
I'm here to be helpful and provide answers to your questions.

[N][                             Run][ 610]: hit eos,avg 20.81 token/s


>> ^Cq

(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI  V2.26.0_20250205130139                                Driver  V2.26.0_20250205130139 |
+-----------------------------------------+--------------+---------------------------------------+
| Card  Name                     Firmware | Bus-Id       |                          Memory-Usage |
| Fan   Temp                Pwr:Usage/Cap | CPU      NPU |                             CMM-Usage |
|=========================================+==============+=======================================|
|    0  AX650N                    V2.26.0 | 0000:01:00.0 |                171 MiB /      945 MiB |
|   --   39C                      -- / -- | 2%        0% |                468 MiB /     7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+

+------------------------------------------------------------------------------------------------+
| Processes:                                                                                     |
| Card      PID  Process Name                                                   NPU Memory Usage |
|================================================================================================|
|    0    18636  /home/axera/qtang/llm-test/smollm2-360m/main_axcl_aarch64            418580 KiB |
+------------------------------------------------------------------------------------------------+
(base) axera@raspberrypi:~ $
```