Qwen2.5
Collection
11 items
•
Updated
This version of Qwen2.5-3B-Instruct has been converted to run on the Axera NPU using w8a16 and w4a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.1
For those who are interested in model conversion, you can try to export axmodel through the original repo
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
AXera NPU AXEngine LLM Runtime
The follow show how to convert Qwen2.5-3B-Instruct-GPTQ-Int8
pulsar2 llm_build --input_path Qwen/Qwen2.5-3B-Instruct-GPTQ-Int8 \
--output_path Qwen/Qwen2.5-3B-Instruct-GPTQ-Int8-ctx-ax650 \
--hidden_state_type bf16 --kv_cache_len 2047 --prefill_len 128 \
--last_kv_cache_len 128 \
--last_kv_cache_len 256 \
--last_kv_cache_len 384 \
--last_kv_cache_len 512 \
--last_kv_cache_len 640 \
--last_kv_cache_len 768 \
--last_kv_cache_len 896 \
--last_kv_cache_len 1024 \
--chip AX650 -c 1 --parallel 8
Chips | w8a16 | w4a16 | DDR(w8) | Flash(w8) | DDR(w4) | Flash(w4) |
---|---|---|---|---|---|---|
AX650 | 2.8GB | 2.9GB |
Download all files from this repository to the device
(base) axera@raspberrypi:~/samples/AXERA-TECH/Qwen2.5-3B-Instruct $ tree -L 1
.
├── config.json
├── main_api_ax650
├── main_api_axcl_aarch64
├── main_api_axcl_x86
├── main_ax650
├── main_axcl_aarch64
├── main_axcl_x86
├── post_config.json
├── qwen2.5-3b-ctx-int4-ax650
├── qwen2.5_tokenizer
├── qwen2.5_tokenizer_uid.py
├── README.md
├── run_qwen2.5_3b_ctx_ax650.sh
├── run_qwen2.5_3b_ctx_axcl_aarch64.sh
├── run_qwen2.5_3b_ctx_axcl_x86.sh
├── run_qwen2.5_3b_ctx_int4_ax650.sh
├── run_qwen2.5_3b_ctx_int4_axcl_aarch64.sh
└── run_qwen2.5_3b_ctx_int4_axcl_x86.sh
3 directories, 16 files
(py312) axera@raspberrypi:~/samples/AXERA-TECH/Qwen2.5-3B-Instruct $ python qwen2.5_tokenizer_uid.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345
--system_prompt
--kvcache_path
mkdir kvcache
./main_axcl_aarch64 \
--template_filename_axmodel "qwen2.5-3b-ctx-int4-ax650/qwen2_p128_l%d_together.axmodel" \
--axmodel_num 36 \
--url_tokenizer_model "http://127.0.0.1:12345" \
--filename_post_axmodel "qwen2.5-3b-ctx-int4-ax650/qwen2_post.axmodel" \
--filename_tokens_embed "qwen2.5-3b-ctx-int4-ax650/model.embed_tokens.weight.bfloat16.bin" \
--tokens_embed_num 151936 \
--tokens_embed_size 2048 \
--use_mmap_load_embed 1 \
--live_print 1 \
--devices 0
#--system_prompt "你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。" \
#--kvcache_path "./kvcache" \
Open another terminal and run run_qwen2.5_3b_ctx_ax650.sh
TODO
What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.
(base) axera@raspberrypi:~/samples/AXERA-TECH/Qwen2.5-3B-Instruct $ ./run_qwen2.5_3b_ctx_int4_axcl_aarch64.sh
[I][ Init][ 130]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: ec8f8194-c12f-41fa-b1a0-1ae232c9f15a
bos_id: -1, eos_id: 151645
2% | █ | 1 / 39 [0.55s<21.33s, 1.83 count/s] tokenizer init ok[I][ Init][ 45]: LLaMaEmbedSelector use mmap
5% | ██ | 2 / 39 [0.55s<10.67s, 3.66 count/s] embed_selector init ok
[I][ run][ 30]: AXCLWorker start with devid 0
76% | ██████████████████████████████████████████████ █ ██ | 29 / 39 [41.68s<58.05s, 0.67 count/s] init 6 axmodel ok,devid(0) remain_cmm(-1 MB) | 30 / 39 [41.6100% | ████████████████████████████████ | 39 / 39 [61.29s<64.60s, 0.60 count/s] init post axmodel ok,remain_cmm(3981 MB)4305 MB)
[I][ Init][ 221]: max_token_len : 2047
[I][ Init][ 224]: kv_cache_size : 256, kv_cache_num: 2047
[I][ Init][ 232]: prefill_token_num : 128
[I][ Init][ 236]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 236]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 236]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 236]: grp: 4, prefill_max_token_num : 384
[I][ Init][ 236]: grp: 5, prefill_max_token_num : 512
[I][ Init][ 236]: grp: 6, prefill_max_token_num : 640
[I][ Init][ 236]: grp: 7, prefill_max_token_num : 768
[I][ Init][ 236]: grp: 8, prefill_max_token_num : 896
[I][ Init][ 236]: grp: 9, prefill_max_token_num : 1024
[I][ Init][ 240]: prefill_max_token_num : 1024
________________________
| ID| remain cmm(MB)|
========================
| 0| 3981|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 263]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 324]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 367]: input_num_token:21
[I][ main][ 234]: precompute_len: 21
[I][ main][ 235]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> who are you
[I][ SetKVCache][ 614]: prefill_grpid:2 kv_cache_num:128 precompute_len:21 input_num_token:11
[I][ SetKVCache][ 617]: current prefill_max_token_num:896
[I][ Run][ 855]: input token num : 11, prefill_split_num : 1
[I][ Run][ 887]: input_num_token:11
[I][ Run][1016]: ttft: 596.11 ms
I am Qwen, created by Alibaba Cloud. I am here to assist with a wide range of tasks and answer a variety of questions. How can I assist you today?
[N][ Run][1168]: hit eos,avg 7.72 token/s
[I][ GetKVCache][ 583]: precompute_len:67, remaining:957
prompt >> q