Qwen2.5-1.5B-Instruct-CTX-Int8

This version of Qwen2.5-1.5B-Instruct-CTX-Int8 has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.0(Not released yet)

Feature

  • Support for longer contexts, in this sample it's 2.5k
  • Support context dialogue
  • System prompt kvcache is supported

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU AXEngine LLM Runtime

AXera NPU AXCL LLM Runtime

Support Platform

Chips w8a16 w4a16 DDR Flash
AX650 11 tokens/sec TBD 2.3GB 2.3GB

How to use

Download all files from this repository to the device

root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# tree -L 1
.
├── kvcache
├── main
├── main_axcl_aarch64
├── main_axcl_x86
├── post_config.json
├── qwen2.5-1.5b-ctx-ax650
├── qwen2.5_tokenizer
├── qwen2.5_tokenizer_uid.py
├── run_qwen2.5_1.5b_ctx_ax650.sh
├── run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
└── run_qwen2.5_1.5b_ctx_axcl_x86.sh

Start the Tokenizer service

root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# python qwen2.5_tokenizer_uid.py
Server running at http://0.0.0.0:12345

System prompt cache

  • The System prompt can be preset through the configuration file from --system_prompt
  • The System prompt can be cached in the form of kv cache to a specified folder for quick loading at the next run time from --kvcache_path
  • This folder needs to be created manually before running, for example mkdir kvcache
(base) axera@raspberrypi:~/samples/qwen2.5-1.5b-ctx $ cat run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
./main_axcl_aarch64 \
--system_prompt "你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。" \
--kvcache_path "./kvcache" \
--template_filename_axmodel "qwen2.5-1.5b-ctx-ax650/qwen2_p128_l%d_together.axmodel" \
--axmodel_num 28 \
--tokenizer_type 2 \
--url_tokenizer_model "http://127.0.0.1:12345" \
--filename_post_axmodel "qwen2.5-1.5b-ctx-ax650/qwen2_post.axmodel" \
--filename_tokens_embed "qwen2.5-1.5b-ctx-ax650/model.embed_tokens.weight.bfloat16.bin" \
--tokens_embed_num 151936 \
--tokens_embed_size 1536 \
--use_mmap_load_embed 1 \
--live_print 1 \
--devices 0

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run run_qwen2.5_1.5b_gptq_int4_ax650.sh

root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# mkdir -p kvcache
root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# ./run_qwen2.5_1.5b_ctx_ax650.sh
[I][                            Init][ 107]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
bos_id: -1, eos_id: 151645
  3% | ██                                |   1 /  31 [0.21s<6.39s, 4.85 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [5.04s<5.04s, 6.15 count/s] init post axmodel ok,remain_cmm(9656 MB)
[I][                            Init][ 185]: max_token_len : 2559
[I][                            Init][ 190]: kv_cache_size : 256, kv_cache_num: 2559
[I][                            Init][ 198]: prefill_token_num : 128
[I][                            Init][ 202]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 202]: grp: 2, prefill_max_token_num : 512
[I][                            Init][ 202]: grp: 3, prefill_max_token_num : 1024
[I][                            Init][ 202]: grp: 4, prefill_max_token_num : 1536
[I][                            Init][ 202]: grp: 5, prefill_max_token_num : 2048
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 213]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[E][                    load_kvcache][ 101]: k_cache ./kvcache/k_cache_0.bin or v_cache ./kvcache/v_cache_0.bin not exist
[W][                            main][ 217]: load kvcache from path: ./kvcache failed,generate kvcache
100% | ████████████████████████████████ |  53 /  53 [4.12s<4.12s, 12.85 token/s]
[I][                      GetKVCache][ 325]: precompute_len:53
[I][                            main][ 224]: generate kvcache to path: ./kvcache
[I][                            main][ 226]: precompute_len: 53
[I][                            main][ 227]: system_prompt: 你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。
prompt >> who are you?
[I][                      SetKVCache][ 354]: prefill_grpid:2 kv_cache_num:512 precompute_len:53 input_num_token:12
[I][                             Run][ 527]: input_embed_num(12)
[I][                             Run][ 642]: ttft: 537.06 ms
我是Allen,一个能够回答问题、提供信息和执行任务的虚拟助手。我可以帮助你解决各种问题、做计划、玩游戏、甚至是进行一些娱乐活动。请问有什么我能帮助你的吗?

[N][                             Run][ 756]: hit eos,avg 11.09 token/s

[I][                      GetKVCache][ 325]: precompute_len:108
prompt >> 今天是几号,天气怎么样
[I][                      SetKVCache][ 354]: prefill_grpid:2 kv_cache_num:512 precompute_len:108 input_num_token:15
[I][                             Run][ 527]: input_embed_num(15)
[I][                             Run][ 642]: ttft: 536.81 ms
今天是4月1日,愚人节。根据您所描述的深圳天气情况,气温在14°C至19°C之间,气温较低,建议穿着适当。希望您今天愉快!

[N][                             Run][ 756]: hit eos,avg 11.17 token/s

[I][                      GetKVCache][ 325]: precompute_len:166

Inference with M.2 Accelerator card

What is M.2 Accelerator card?, Show this DEMO based on Raspberry PI 5.

(base) axera@raspberrypi:~/samples/Qwen2.5-1.5B-Instruct-CTX-Int8 $ mkdir -p kvcache
(base) axera@raspberrypi:~/samples/Qwen2.5-1.5B-Instruct-CTX-Int8 $ ./run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
[I][                            Init][ 134]: LLM init start
[I][                            Init][  41]: connect http://127.0.0.1:12345 ok
bos_id: -1, eos_id: 151645
  3% | ██                                |   1 /  31 [0.46s<14.11s, 2.20 count/s] tokenizer init ok
[I][                            Init][  45]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [0.46s<7.05s, 4.40 count/s] embed_selector init ok
[I][                             run][  30]: AXCLWorker start with devid 0
100% | ████████████████████████████████ |  31 /  31 [29.18s<29.18s, 1.06 count/s] init post axmodel ok,remain_cmm(-1 MB)m(-1 MB)
[I][                            Init][ 235]: max_token_len : 2559
[I][                            Init][ 238]: kv_cache_size : 256, kv_cache_num: 2559
[I][                            Init][ 246]: prefill_token_num : 128
[I][                            Init][ 250]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 250]: grp: 2, prefill_max_token_num : 512
[I][                            Init][ 250]: grp: 3, prefill_max_token_num : 1024
[I][                            Init][ 250]: grp: 4, prefill_max_token_num : 1536
[I][                            Init][ 250]: grp: 5, prefill_max_token_num : 2048
________________________
|    ID| remain cmm(MB)|
========================
|     0|             -1|
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 275]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[E][                    load_kvcache][ 100]: k_cache ./kvcache/k_cache_0.bin or v_cache ./kvcache/v_cache_0.bin not exist
[W][                            main][ 223]: load kvcache from path: ./kvcache failed,generate kvcache
100% | ████████████████████████████████ |  53 /  53 [5.06s<5.06s, 10.47 token/s]
[I][                      GetKVCache][ 419]: precompute_len:53
[I][                            main][ 230]: generate kvcache to path: ./kvcache
[I][                            main][ 232]: precompute_len: 53
[I][                            main][ 233]: system_prompt: 你的名字叫小智(allen),你是一个人畜无害的AI助手。深圳市今天(4月1日)阴天,愚人节,气温在14°C至19°C之间,微风。
prompt >> 你是谁
[I][                      SetKVCache][ 448]: prefill_grpid:2 kv_cache_num:512 precompute_len:53 input_num_token:10
[I][                             Run][ 722]: input token num : 10
[I][                             Run][ 823]: ttft: 548.23 ms
我是深圳市气象局发布的天气预报,我叫小智,是为了解答大家关于天气的问题而设计的。如果你对天气有疑问,欢迎随时询问!

[N][                             Run][ 975]: hit eos,avg 9.04 token/s

[I][                      GetKVCache][ 419]: precompute_len:98
prompt >> 你能干什么
[I][                      SetKVCache][ 448]: prefill_grpid:2 kv_cache_num:512 precompute_len:98 input_num_token:10
[I][                             Run][ 722]: input token num : 10
[I][                             Run][ 823]: ttft: 548.07 ms
我能回答关于天气、生活、科技、文化、娱乐、历史等方面的很多问题。如果你有任何想知道的内容,都可以问我哦!

[N][                             Run][ 975]: hit eos,avg 9.03 token/s

[I][                      GetKVCache][ 419]: precompute_len:135
prompt >> q
[I][                             run][  80]: AXCLWorker exit with devid 0


>> q

(base) axera@raspberrypi:~ $ axcl-smi
+------------------------------------------------------------------------------------------------+
| AXCL-SMI  V2.25.0_20250117163029                                Driver  V2.25.0_20250117163029 |
+-----------------------------------------+--------------+---------------------------------------+
| Card  Name                     Firmware | Bus-Id       |                          Memory-Usage |
| Fan   Temp                Pwr:Usage/Cap | CPU      NPU |                             CMM-Usage |
|=========================================+==============+=======================================|
|    0  AX650N                    V2.25.0 | 0000:01:00.0 |                188 MiB /      945 MiB |
|   --   37C                      -- / -- | 1%        0% |               2335 MiB /     7040 MiB |
+-----------------------------------------+--------------+---------------------------------------+

+------------------------------------------------------------------------------------------------+
| Processes:                                                                                     |
| Card      PID  Process Name                                                   NPU Memory Usage |
|================================================================================================|
|    0   147835  /home/axera/samples/qwen2.5-1.5b-ctx/main_axcl_aarch64               1990172 KiB |
+------------------------------------------------------------------------------------------------+
(base) axera@raspberrypi:~ $
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen2.5-1.5B-Instruct-CTX-Int8

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(608)
this model

Collection including AXERA-TECH/Qwen2.5-1.5B-Instruct-CTX-Int8