File size: 5,242 Bytes
d59149b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
# Phi-3.5 Mini Instruct - Quantized ONNX Model (Consolidated)
## π Model Overview
This is Microsoft's Phi-3.5-mini-instruct model, quantized to INT8 and optimized for Qualcomm Snapdragon NPU deployment. This version consolidates all files into a single directory for easier deployment.
## π Model Specifications
- **Base Model**: microsoft/Phi-3.5-mini-instruct
- **Size**: 7292.4 MB (quantized from 7.3GB original)
- **Compression**: 50% size reduction
- **Format**: ONNX INT8 quantized with external data
- **Files**: 203 files total
- **Target**: Qualcomm Snapdragon NPUs
## π§ Quick Start
### Installation
```bash
pip install onnxruntime transformers numpy
```
### Basic Usage
```python
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
# Load ONNX model
session = ort.InferenceSession("model.onnx")
# Prepare input
text = "Hello, what is artificial intelligence?"
inputs = tokenizer(text, return_tensors="np", max_length=64, truncation=True, padding="max_length")
# Run inference
outputs = session.run(None, {"input_ids": inputs["input_ids"]})
logits = outputs[0]
print(f"Input: {text}")
print(f"Output shape: {logits.shape}")
```
### Text Generation Example
```python
def generate_response(prompt, max_new_tokens=50):
# Tokenize
inputs = tokenizer(prompt, return_tensors="np", max_length=64, truncation=True)
input_ids = inputs["input_ids"]
generated_tokens = []
for _ in range(max_new_tokens):
# Get model prediction
outputs = session.run(None, {"input_ids": input_ids})
logits = outputs[0]
# Get next token (greedy)
next_token_id = np.argmax(logits[0, -1, :])
generated_tokens.append(next_token_id)
# Stop on EOS
if next_token_id == tokenizer.eos_token_id:
break
# Add to input for next iteration
input_ids = np.concatenate([input_ids, [[next_token_id]]], axis=1)
# Decode response
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
return response
# Example
response = generate_response("What is machine learning?")
print(f"Response: {response}")
```
## π§ͺ Testing Script
```python
#!/usr/bin/env python3
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
def test_model():
print("π Loading model...")
tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
session = ort.InferenceSession("model.onnx")
test_cases = [
"Hello, how are you?",
"What is the capital of France?",
"Explain artificial intelligence in simple terms."
]
for i, text in enumerate(test_cases, 1):
print(f"\n{i}. Input: {text}")
inputs = tokenizer(text, return_tensors="np", max_length=64,
truncation=True, padding="max_length")
outputs = session.run(None, {"input_ids": inputs["input_ids"]})
print(f" β
Output shape: {outputs[0].shape}")
print("\nπ All tests passed!")
if __name__ == "__main__":
test_model()
```
## β‘ Performance Expectations
- **Inference Speed**: 2-3x faster than CPU on Snapdragon NPUs
- **Memory Usage**: ~4GB RAM required
- **Tokens/Second**: 8-15 on Snapdragon 8cx Gen 2
- **Latency**: <100ms for short sequences
## π File Structure
```
model.onnx # Main ONNX model file
tokenizer.json # Tokenizer vocabulary
tokenizer_config.json # Tokenizer configuration
config.json # Model configuration
onnx__MatMul_* # External weight data files (129 files)
*.weight # Additional model weights
```
## β οΈ Important Notes
1. **All Files Required**: Keep all files in the same directory. The model.onnx file references external data files.
2. **Memory Requirements**: Ensure you have at least 4GB of available RAM.
3. **Qualcomm NPU Setup**: For optimal performance on Qualcomm hardware:
```python
# Use QNN execution provider (when available)
providers = ['QNNExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("model.onnx", providers=providers)
```
## π Deployment on Qualcomm Devices
### Windows on ARM
1. Copy all files to your device
2. Install ONNX Runtime: `pip install onnxruntime`
3. Run the test script to verify
### Android (with QNN SDK)
1. Use ONNX Runtime Mobile with QNN support
2. Package all files in your app bundle
3. Initialize with QNN execution provider
## π Troubleshooting
**Model fails to load:**
- Ensure all files are in the same directory
- Check that you have sufficient RAM (4GB+)
**Slow inference:**
- Try enabling graph optimizations:
```python
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options)
```
**Out of memory:**
- Reduce sequence length: `max_length=32`
- Process smaller batches
## π License
This model inherits the license from microsoft/Phi-3.5-mini-instruct.
---
*Quantized and optimized for Qualcomm Snapdragon NPU deployment*
|