NPU performance

#1
by Echo9Zulu - opened

Hi!

A few questions:

Did you you use openvino genai or optimum-intel?

How much context were you able to run?

Can you share throughput results?

Thanks for the upload! Check out my repo for even more OpenVINO quants.

It should work with openvino-genai and optimum-intel, but you need to make sure the NPU driver is updated and you're using openvino 2025.2.0. We only tested with openvino-genai.

pip install openvino==2025.2.0
pip install optimum[openvino]
pip install openvino-genai

NPU driver version required: 32.0.100.4023

Thanks for sharing all your models! Have you tested Gemma on NPU? We've been meaning to convert them as well but haven't gotten around to it yet.

Context-wise, we tested up to 8k, but the results weren't great. We ended up going with the https://huggingface.co/bweng/qwen3-8b-int4-ov-npu model instead.

You probably know this, but in openvino-genai you can increase the input context with MAX_PROMPT_LEN:

import openvino_genai as ov_genai
device = "NPU"
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, CACHE_DIR="./cache")

https://huggingface.co/bweng/phi-4-mini-instruct-int4-ov-npu

Sign up or log in to comment