Qwen3-Embedding-0.6B

This version of Qwen3-Embedding-0.6B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.1

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

Support Platform

Each subgraph is time-consuming

g1: 5.561 ms
g2: 9.140 ms
g3: 12.757 ms
g4: 16.446 ms
g5: 21.392 ms
g6: 23.712 ms
g7: 27.174 ms
g8: 30.897 ms
g9: 34.829 ms
  • Shortest time(forward) consumption: 5.561 ms
  • Longest time(forward) consumption: 181.908 ms
  • LayerNum: 28
Chips ttft w8a16
AX650 155.708 ms (128 token 最短耗时) 0.82 tokens/sec
AX650 5093.42 ms (1024 token 最长耗时) 0.20 tokens/sec

How to use

Download all files from this repository to the device.

If you using AX650 Board

root@ax650 ~/yongqiang/push_hugging_face/Qwen3-Embedding-0.6B # tree -L 1
.
├── config.json
├── infer_axmodel.py
├── qwen3_embedding_0.6b_axmodel
├── qwen3_embedding_0.6b_tokenizer
├── README.md
└── utils

3 directories, 3 files

Install transformer

# Requires transformers>=4.51.0
pip install transformers==4.51.0

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

$ root@ax650 ~/yongqiang/push_hugging_face/Qwen3-Embedding-0.6B # python3 infer_axmodel.py
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
[[0.7555467486381531, 0.1756950318813324], [0.4137178063392639, 0.4459586441516876]]
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-Embedding-0.6B

Finetuned
(28)
this model