Qwen3-Embedding-0.6B

This version of Qwen3-Embedding-0.6B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.1

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Each subgraph is time-consuming

g1: 5.561 ms
g2: 9.140 ms
g3: 12.757 ms
g4: 16.446 ms
g5: 21.392 ms
g6: 23.712 ms
g7: 27.174 ms
g8: 30.897 ms
g9: 34.829 ms

Shortest time(forward) consumption: 5.561 ms
Longest time(forward) consumption: 181.908 ms
LayerNum: 28

Chips	ttft	w8a16
AX650	155.708 ms (128 token 最短耗时)	0.82 tokens/sec
AX650	5093.42 ms (1024 token 最长耗时)	0.20 tokens/sec

How to use

Download all files from this repository to the device.

If you using AX650 Board

root@ax650 ~/yongqiang/push_hugging_face/Qwen3-Embedding-0.6B # tree -L 1
.
├── config.json
├── infer_axmodel.py
├── qwen3_embedding_0.6b_axmodel
├── qwen3_embedding_0.6b_tokenizer
├── README.md
└── utils

3 directories, 3 files

Install transformer

# Requires transformers>=4.51.0
pip install transformers==4.51.0

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

$ root@ax650 ~/yongqiang/push_hugging_face/Qwen3-Embedding-0.6B # python3 infer_axmodel.py
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
[[0.7555467486381531, 0.1756950318813324], [0.4137178063392639, 0.4459586441516876]]

AXERA-TECH
/

Qwen3-Embedding-0.6B