bweng/qwen3-4b-int4-ov-npu · NPU performance

It should work with openvino-genai and optimum-intel, but you need to make sure the NPU driver is updated and you're using openvino 2025.2.0. We only tested with openvino-genai.

pip install openvino==2025.2.0
pip install optimum[openvino]
pip install openvino-genai

NPU driver version required: 32.0.100.4023

Thanks for sharing all your models! Have you tested Gemma on NPU? We've been meaning to convert them as well but haven't gotten around to it yet.

Context-wise, we tested up to 8k, but the results weren't great. We ended up going with the https://huggingface.co/bweng/qwen3-8b-int4-ov-npu model instead.

You probably know this, but in openvino-genai you can increase the input context with MAX_PROMPT_LEN:

import openvino_genai as ov_genai
device = "NPU"
pipe = ov_genai.LLMPipeline(model_path, "NPU", MAX_PROMPT_LEN=4096, CACHE_DIR="./cache")

https://huggingface.co/bweng/phi-4-mini-instruct-int4-ov-npu