Qwen2.5-3B-Instruct

This version of Qwen2.5-3B-Instruct has been converted to run on the Axera NPU using w8a16 and w4a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.1

Feature

Support for longer contexts, in this sample it's 2k
Support context dialogue
System prompt kvcache is supported

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU AXEngine LLM Runtime

AXera NPU AXCL LLM Runtime

Convert script

The follow show how to convert Qwen2.5-3B-Instruct-GPTQ-Int8

pulsar2 llm_build --input_path Qwen/Qwen2.5-3B-Instruct-GPTQ-Int8  \
                  --output_path Qwen/Qwen2.5-3B-Instruct-GPTQ-Int8-ctx-ax650 \
                  --hidden_state_type bf16 --kv_cache_len 2047 --prefill_len 128 \
                  --last_kv_cache_len 128 \
                  --last_kv_cache_len 256 \
                  --last_kv_cache_len 384 \
                  --last_kv_cache_len 512 \
                  --last_kv_cache_len 640 \
                  --last_kv_cache_len 768 \
                  --last_kv_cache_len 896 \
                  --last_kv_cache_len 1024 \
                  --chip AX650 -c 1 --parallel 8

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Chips	w8a16	w4a16	DDR(w8)	Flash(w8)	DDR(w4)	Flash(w4)
AX650					2.8GB	2.9GB

How to use

Download all files from this repository to the device

(base) axera@raspberrypi:~/samples/AXERA-TECH/Qwen2.5-3B-Instruct $ tree -L 1
.
├── config.json
├── main_api_ax650
├── main_api_axcl_aarch64
├── main_api_axcl_x86
├── main_ax650
├── main_axcl_aarch64
├── main_axcl_x86
├── post_config.json
├── qwen2.5-3b-ctx-int4-ax650
├── qwen2.5_tokenizer
├── qwen2.5_tokenizer_uid.py
├── README.md
├── run_qwen2.5_3b_ctx_ax650.sh
├── run_qwen2.5_3b_ctx_axcl_aarch64.sh
├── run_qwen2.5_3b_ctx_axcl_x86.sh
├── run_qwen2.5_3b_ctx_int4_ax650.sh
├── run_qwen2.5_3b_ctx_int4_axcl_aarch64.sh
└── run_qwen2.5_3b_ctx_int4_axcl_x86.sh

3 directories, 16 files

Start the Tokenizer service

(py312) axera@raspberrypi:~/samples/AXERA-TECH/Qwen2.5-3B-Instruct $ python qwen2.5_tokenizer_uid.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345

System prompt cache

The System prompt can be preset through the configuration file from --system_prompt
The System prompt can be cached in the form of kv cache to a specified folder for quick loading at the next run time from --kvcache_path
This folder needs to be created manually before running, for example mkdir kvcache

./main_axcl_aarch64 \
--template_filename_axmodel "qwen2.5-3b-ctx-int4-ax650/qwen2_p128_l%d_together.axmodel" \
--axmodel_num 36 \
--url_tokenizer_model "http://127.0.0.1:12345" \
--filename_post_axmodel "qwen2.5-3b-ctx-int4-ax650/qwen2_post.axmodel" \
--filename_tokens_embed "qwen2.5-3b-ctx-int4-ax650/model.embed_tokens.weight.bfloat16.bin" \
--tokens_embed_num 151936 \
--tokens_embed_size 2048 \
--use_mmap_load_embed 1 \
--live_print 1 \
--devices 0

#--system_prompt "你的名字叫小智（allen）,你是一个人畜无害的AI助手。深圳市今天（4月1日）阴天，愚人节，气温在14°C至19°C之间，微风。" \
#--kvcache_path "./kvcache" \

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run run_qwen2.5_3b_ctx_ax650.sh

TODO

Inference with M.2 Accelerator card

What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.

(base) axera@raspberrypi:~/samples/AXERA-TECH/Qwen2.5-3B-Instruct $ ./run_qwen2.5_3b_ctx_int4_axcl_aarch64.sh
[I][                            Init][ 130]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: ec8f8194-c12f-41fa-b1a0-1ae232c9f15a
bos_id: -1, eos_id: 151645
  2% | █                                 |   1 /  39 [0.55s<21.33s, 1.83 count/s] tokenizer init ok[I][                            Init][  45]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.55s<10.67s, 3.66 count/s] embed_selector init ok
[I][                             run][  30]: AXCLWorker start with devid 0
 76% | ██████████████████████████████████████████████  █       ██  |  29 /  39 [41.68s<58.05s, 0.67 count/s] init 6 axmodel ok,devid(0) remain_cmm(-1 MB)        |  30 /  39 [41.6100% | ████████████████████████████████ |  39 /  39 [61.29s<64.60s, 0.60 count/s] init post axmodel ok,remain_cmm(3981 MB)4305 MB)
[I][                            Init][ 221]: max_token_len : 2047
[I][                            Init][ 224]: kv_cache_size : 256, kv_cache_num: 2047
[I][                            Init][ 232]: prefill_token_num : 128
[I][                            Init][ 236]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 236]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 236]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 236]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 236]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 236]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 236]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 236]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 236]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 240]: prefill_max_token_num : 1024
________________________
|    ID| remain cmm(MB)|
========================
|     0|           3981|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 263]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][          GenerateKVCachePrefill][ 324]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][          GenerateKVCachePrefill][ 367]: input_num_token:21
[I][                            main][ 234]: precompute_len: 21
[I][                            main][ 235]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> who are you
[I][                      SetKVCache][ 614]: prefill_grpid:2 kv_cache_num:128 precompute_len:21 input_num_token:11
[I][                      SetKVCache][ 617]: current prefill_max_token_num:896
[I][                             Run][ 855]: input token num : 11, prefill_split_num : 1
[I][                             Run][ 887]: input_num_token:11
[I][                             Run][1016]: ttft: 596.11 ms
I am Qwen, created by Alibaba Cloud. I am here to assist with a wide range of tasks and answer a variety of questions. How can I assist you today?

[N][                             Run][1168]: hit eos,avg 7.72 token/s

[I][                      GetKVCache][ 583]: precompute_len:67, remaining:957
prompt >> q

AXERA-TECH
/

Qwen2.5-3B-Instruct