File size: 7,654 Bytes
e4ac2c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
---
library_name: transformers
license: apache-2.0
base_model:
- HuggingFaceTB/SmolLM2-360M-Instruct
tags:
- HuggingFaceTB
- SmolLM2
- SmolLM2-360M-Instruct
- Int8
- M5Stack
- RaspberryPi 5
language:
- en
---
# SmolLM2-360M-Instruct

This version of SmolLM2-360M-Instruct has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.4(Not released yet)
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo
https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU HOST LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/internvl2)
[AXera NPU AXCL LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/axcl-llm-internvl)
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(η±θ―ζ΄ΎPro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
- [η±θ―ζ΄Ύ2](https://axera-pi-2-docs-cn.readthedocs.io/zh-cn/latest/index.html)
- [Module-LLM](https://docs.m5stack.com/zh_CN/module/Module-LLM)
- [LLM630 Compute Kit](https://docs.m5stack.com/zh_CN/core/LLM630%20Compute%20Kit)
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 39 tokens/sec|todo|
|AX630C| 14 tokens/sec|todo|
## How to use
Download all files from this repository to the device
```
root@ax650:/mnt/qtang/llm-test/smollm2-360m# tree -L 1
.
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- main_prefill
|-- post_config.json
|-- run_smollm2_360m_ax630c.sh
|-- run_smollm2_360m_ax650.sh
|-- run_smollm2_360m_axcl_aarch64.sh
|-- run_smollm2_360m_axcl_x86.sh
|-- smollm2-360m-ax630c
|-- smollm2-360m-ax650
|-- smollm2_tokenizer
`-- smollm2_tokenizer.py
```
### Install transformer
```
pip install transformers==4.41.1
```
### Start the Tokenizer service
```
root@ax650:/mnt/qtang/llm-test/smollm2-360m$ python smollm2_tokenizer.py --port 12345
1 <|im_start|> 2 <|im_end|>
<|im_start|>system
You are a helpful AI assistant named SmolLM, trained by Hugging Face<|im_end|>
<|im_start|>user
hello world<|im_end|>
<|im_start|>assistant
[1, 9690, 198, 2683, 359, 253, 5356, 5646, 11173, 3365, 3511, 308, 34519, 28, 7018, 411, 407, 19712, 8182, 2, 198, 1, 4093, 198, 28120, 905, 2, 198, 1, 520, 9531, 198]
http://localhost:12345
```
### Inference with AX650 Host, such as M4N-Dock(η±θ―ζ΄ΎPro) or AX650N DEMO Board
Open another terminal and run `run_smollm2_360m_ax650.sh`
```
root@ax650:/mnt/qtang/llm-test/smollm2-360m# ./run_smollm2_360m_ax650.sh
[I][ Init][ 125]: LLM init start
bos_id: 1, eos_id: 2
2% | β | 1 / 35 [0.00s<0.14s, 250.00 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ββββββββββββββββββββββββββββββββ | 35 / 35 [0.81s<0.81s, 43.37 count/s] init post axmodel ok,remain_cmm(3339 MB)
[I][ Init][ 241]: max_token_len : 1023
[I][ Init][ 246]: kv_cache_size : 320, kv_cache_num: 1023
[I][ Init][ 254]: prefill_token_num : 128
[I][ load_config][ 281]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 268]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you?
[I][ Run][ 466]: ttft: 156.63 ms
I'm a chatbot developed by the Artificial Intelligence Research and Development Lab (AI R&D Lab) at Hugging Face Labs,
specifically designed to facilitate and augment human-AI conversations. My role is to provide assistance in understanding
and responding to natural language queries, using advanced language models and AI algorithms to understand context and intent.
[N][ Run][ 605]: hit eos,avg 38.70 token/s
>> q
```
### Inference with M.2 Accelerator card
[What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.
```
(base) axera@raspberrypi:~/samples/smollm2-360m $ ./run_smollm2_360m_axcl_aarch64.sh
build time: Feb 13 2025 15:44:57
[I][ Init][ 111]: LLM init start
bos_id: 1, eos_id: 2
100% | ββββββββββββββββββββββββββββββββ | 35 / 35 [18.07s<18.07s, 1.94 count/s] init post axmodel okremain_cmm(6621 MB)
[I][ Init][ 226]: max_token_len : 1023
[I][ Init][ 231]: kv_cache_size : 320, kv_cache_num: 1023
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 288]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you?
I'm a virtual AI assistant, designed to support users with their questions and tasks.
I was trained on a vast dataset of text, including text from various sources and
conversations. This extensive training allows me to understand and respond to a wide range of queries.
I'm here to be helpful and provide answers to your questions.
[N][ Run][ 610]: hit eos,avg 20.81 token/s
>> ^Cq
(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI V2.26.0_20250205130139 Driver V2.26.0_20250205130139 |
+-----------------------------------------+--------------+---------------------------------------+
| Card Name Firmware | Bus-Id | Memory-Usage |
| Fan Temp Pwr:Usage/Cap | CPU NPU | CMM-Usage |
|=========================================+==============+=======================================|
| 0 AX650N V2.26.0 | 0000:01:00.0 | 171 MiB / 945 MiB |
| -- 39C -- / -- | 2% 0% | 468 MiB / 7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+
+------------------------------------------------------------------------------------------------+
| Processes: |
| Card PID Process Name NPU Memory Usage |
|================================================================================================|
| 0 18636 /home/axera/qtang/llm-test/smollm2-360m/main_axcl_aarch64 418580 KiB |
+------------------------------------------------------------------------------------------------+
(base) axera@raspberrypi:~ $
``` |