Phi-3-vision-128k-instruct ONNX models for CPU and CUDA

This repository hosts the optimized versions of microsoft/Phi-3-vision-128k-instruct to accelerate inference with ONNX Runtime. This repository is a clone from microsoft/Phi-3-vision-128k-instruct-onnx-cpu, with extra files necessary for deploying the model with OpenAI-API-Compatible endpoints through embeddedllm pypi library.

Usage on Windows (Intel / AMD / Nvidia / Qualcomm)

conda create -n onnx python=3.10
conda activate onnx
winget install -e --id GitHub.GitLFS
pip install huggingface-hub[cli]
huggingface-cli download EmbeddedLLM/Phi-3-vision-128k-instruct-onnx --include='onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4' --local-dir .\Phi-3-vision-128k-instruct-onnx
pip install numpy==1.26.4
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py" -OutFile "phi3v.py"
pip install onnxruntime
pip install --pre onnxruntime-genai==0.3.0rc2
python phi3v.py -m .\Phi-3-vision-128k-instruct-onnx

UPSTREAM README.md

Phi-3-vision-128k-instruct ONNX

This repository hosts the optimized versions of microsoft/Phi-3-vision-128k-instruct to accelerate inference with DirectML and ONNX Runtime.

The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision.
The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Intended Uses

Primary use cases

The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require

memory/compute constrained environments;
latency bound scenarios;
general image understanding;
OCR;
chart and table understanding.

Our model is designed to accelerate research on efficient language and multimodal models, for use as a building block for generative AI powered features.

Use case considerations

Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

ONNX Models

Here are some of the optimized configurations we have added:

ONNX model for int4 DirectML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using AWQ.
ONNX model for int4 CPU and Mobile: ONNX model for CPU and mobile using int4 quantization via RTN. There are two versions uploaded to balance latency vs. accuracy. Acc=1 is targeted at improved accuracy, while Acc=4 is for improved performance. For mobile devices, we recommend using the model with acc-level-4.

Usage

Installation and Setup

To use the Phi-3-vision-128k-instruct ONNX model on Windows with DirectML, follow these steps:

Create and activate a Conda environment:

conda create -n onnx python=3.10
conda activate onnx

Install Git LFS:

winget install -e --id GitHub.GitLFS

Install Hugging Face CLI:

pip install huggingface-hub[cli]

Download the model:

huggingface-cli download EmbeddedLLM/Phi-3-vision-128k-instruct-onnx --include="onnx/cpu_and_mobile/*" --local-dir .\Phi-3-vision-128k-instruct

Install necessary Python packages:

pip install numpy==1.26.4
pip install onnxruntime
pip install --pre onnxruntime-genai==0.3.0rc2

Install Visual Studio 2015 runtime:

conda install conda-forge::vs2015_runtime

Download the example script:

Invoke-WebRequest -Uri "https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3-qa.py" -OutFile "phi3-qa.py"

Run the example script:

python phi3-qa.py -m .\Phi-3-vision-128k-instruct

Hardware Requirements

Minimum Configuration:

Windows: DirectX 12-capable GPU (AMD/Nvidia/Intel)
CPU: x86_64 / ARM64

Tested Configurations:

GPU: AMD Ryzen 8000 Series iGPU (DirectML)
CPU: AMD Ryzen CPU

Hardware Supported

The model has been tested on:

GPU SKU: RTX 4090 (DirectML)

Minimum Configuration Required:

Windows: DirectX 12-capable GPU and a minimum of 10GB of combined RAM

Model Description

Developed by: Microsoft
Model type: ONNX
Language(s) (NLP): Python, C, C++
License: MIT
Model Description: This is a conversion of the Phi-3 Vision 128K Instruct model for ONNX Runtime inference.

Additional Details

License

The model is licensed under the MIT license.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.