Qwen3-Embedding-0.6B
This version of Qwen3-Embedding-0.6B has been converted to run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.1
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
Each subgraph is time-consuming
g1: 5.561 ms
g2: 9.140 ms
g3: 12.757 ms
g4: 16.446 ms
g5: 21.392 ms
g6: 23.712 ms
g7: 27.174 ms
g8: 30.897 ms
g9: 34.829 ms
- Shortest time(forward) consumption: 5.561 ms
- Longest time(forward) consumption: 181.908 ms
- LayerNum: 28
Chips | ttft | w8a16 |
---|---|---|
AX650 | 155.708 ms (128 token 最çŸè€—æ—¶) | 0.82 tokens/sec |
AX650 | 5093.42 ms (1024 token 最长耗时) | 0.20 tokens/sec |
How to use
Download all files from this repository to the device.
If you using AX650 Board
root@ax650 ~/yongqiang/push_hugging_face/Qwen3-Embedding-0.6B # tree -L 1
.
├── config.json
├── infer_axmodel.py
├── qwen3_embedding_0.6b_axmodel
├── qwen3_embedding_0.6b_tokenizer
├── README.md
└── utils
3 directories, 3 files
Install transformer
# Requires transformers>=4.51.0
pip install transformers==4.51.0
Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
$ root@ax650 ~/yongqiang/push_hugging_face/Qwen3-Embedding-0.6B # python3 infer_axmodel.py
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
slice_indices: [0]
Slice prefill done: 0
[[0.7555467486381531, 0.1756950318813324], [0.4137178063392639, 0.4459586441516876]]
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support