YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Phi-3.5 Mini Instruct - Quantized ONNX Model (Consolidated)
π Model Overview
This is Microsoft's Phi-3.5-mini-instruct model, quantized to INT8 and optimized for Qualcomm Snapdragon NPU deployment. This version consolidates all files into a single directory for easier deployment.
π Model Specifications
- Base Model: microsoft/Phi-3.5-mini-instruct
- Size: 7292.4 MB (quantized from 7.3GB original)
- Compression: 50% size reduction
- Format: ONNX INT8 quantized with external data
- Files: 203 files total
- Target: Qualcomm Snapdragon NPUs
π§ Quick Start
Installation
pip install onnxruntime transformers numpy
Basic Usage
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
# Load ONNX model
session = ort.InferenceSession("model.onnx")
# Prepare input
text = "Hello, what is artificial intelligence?"
inputs = tokenizer(text, return_tensors="np", max_length=64, truncation=True, padding="max_length")
# Run inference
outputs = session.run(None, {"input_ids": inputs["input_ids"]})
logits = outputs[0]
print(f"Input: {text}")
print(f"Output shape: {logits.shape}")
Text Generation Example
def generate_response(prompt, max_new_tokens=50):
# Tokenize
inputs = tokenizer(prompt, return_tensors="np", max_length=64, truncation=True)
input_ids = inputs["input_ids"]
generated_tokens = []
for _ in range(max_new_tokens):
# Get model prediction
outputs = session.run(None, {"input_ids": input_ids})
logits = outputs[0]
# Get next token (greedy)
next_token_id = np.argmax(logits[0, -1, :])
generated_tokens.append(next_token_id)
# Stop on EOS
if next_token_id == tokenizer.eos_token_id:
break
# Add to input for next iteration
input_ids = np.concatenate([input_ids, [[next_token_id]]], axis=1)
# Decode response
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
return response
# Example
response = generate_response("What is machine learning?")
print(f"Response: {response}")
π§ͺ Testing Script
#!/usr/bin/env python3
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
def test_model():
print("π Loading model...")
tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
session = ort.InferenceSession("model.onnx")
test_cases = [
"Hello, how are you?",
"What is the capital of France?",
"Explain artificial intelligence in simple terms."
]
for i, text in enumerate(test_cases, 1):
print(f"\n{i}. Input: {text}")
inputs = tokenizer(text, return_tensors="np", max_length=64,
truncation=True, padding="max_length")
outputs = session.run(None, {"input_ids": inputs["input_ids"]})
print(f" β
Output shape: {outputs[0].shape}")
print("\nπ All tests passed!")
if __name__ == "__main__":
test_model()
β‘ Performance Expectations
- Inference Speed: 2-3x faster than CPU on Snapdragon NPUs
- Memory Usage: ~4GB RAM required
- Tokens/Second: 8-15 on Snapdragon 8cx Gen 2
- Latency: <100ms for short sequences
π File Structure
model.onnx # Main ONNX model file
tokenizer.json # Tokenizer vocabulary
tokenizer_config.json # Tokenizer configuration
config.json # Model configuration
onnx__MatMul_* # External weight data files (129 files)
*.weight # Additional model weights
β οΈ Important Notes
All Files Required: Keep all files in the same directory. The model.onnx file references external data files.
Memory Requirements: Ensure you have at least 4GB of available RAM.
Qualcomm NPU Setup: For optimal performance on Qualcomm hardware:
# Use QNN execution provider (when available)
providers = ['QNNExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("model.onnx", providers=providers)
π Deployment on Qualcomm Devices
Windows on ARM
- Copy all files to your device
- Install ONNX Runtime:
pip install onnxruntime
- Run the test script to verify
Android (with QNN SDK)
- Use ONNX Runtime Mobile with QNN support
- Package all files in your app bundle
- Initialize with QNN execution provider
π Troubleshooting
Model fails to load:
- Ensure all files are in the same directory
- Check that you have sufficient RAM (4GB+)
Slow inference:
- Try enabling graph optimizations:
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options)
Out of memory:
- Reduce sequence length:
max_length=32
- Process smaller batches
π License
This model inherits the license from microsoft/Phi-3.5-mini-instruct.
Quantized and optimized for Qualcomm Snapdragon NPU deployment
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support