YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Phi-3.5 Mini Instruct - Quantized ONNX Model (Consolidated)

πŸš€ Model Overview

This is Microsoft's Phi-3.5-mini-instruct model, quantized to INT8 and optimized for Qualcomm Snapdragon NPU deployment. This version consolidates all files into a single directory for easier deployment.

πŸ“Š Model Specifications

  • Base Model: microsoft/Phi-3.5-mini-instruct
  • Size: 7292.4 MB (quantized from 7.3GB original)
  • Compression: 50% size reduction
  • Format: ONNX INT8 quantized with external data
  • Files: 203 files total
  • Target: Qualcomm Snapdragon NPUs

πŸ”§ Quick Start

Installation

pip install onnxruntime transformers numpy

Basic Usage

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Prepare input
text = "Hello, what is artificial intelligence?"
inputs = tokenizer(text, return_tensors="np", max_length=64, truncation=True, padding="max_length")

# Run inference
outputs = session.run(None, {"input_ids": inputs["input_ids"]})
logits = outputs[0]

print(f"Input: {text}")
print(f"Output shape: {logits.shape}")

Text Generation Example

def generate_response(prompt, max_new_tokens=50):
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="np", max_length=64, truncation=True)
    input_ids = inputs["input_ids"]
    
    generated_tokens = []
    
    for _ in range(max_new_tokens):
        # Get model prediction
        outputs = session.run(None, {"input_ids": input_ids})
        logits = outputs[0]
        
        # Get next token (greedy)
        next_token_id = np.argmax(logits[0, -1, :])
        generated_tokens.append(next_token_id)
        
        # Stop on EOS
        if next_token_id == tokenizer.eos_token_id:
            break
        
        # Add to input for next iteration
        input_ids = np.concatenate([input_ids, [[next_token_id]]], axis=1)
    
    # Decode response
    response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
    return response

# Example
response = generate_response("What is machine learning?")
print(f"Response: {response}")

πŸ§ͺ Testing Script

#!/usr/bin/env python3
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

def test_model():
    print("πŸ”„ Loading model...")
    tokenizer = AutoTokenizer.from_pretrained(".", trust_remote_code=True)
    session = ort.InferenceSession("model.onnx")
    
    test_cases = [
        "Hello, how are you?",
        "What is the capital of France?",
        "Explain artificial intelligence in simple terms."
    ]
    
    for i, text in enumerate(test_cases, 1):
        print(f"\n{i}. Input: {text}")
        
        inputs = tokenizer(text, return_tensors="np", max_length=64, 
                          truncation=True, padding="max_length")
        outputs = session.run(None, {"input_ids": inputs["input_ids"]})
        
        print(f"   βœ… Output shape: {outputs[0].shape}")
    
    print("\nπŸŽ‰ All tests passed!")

if __name__ == "__main__":
    test_model()

⚑ Performance Expectations

  • Inference Speed: 2-3x faster than CPU on Snapdragon NPUs
  • Memory Usage: ~4GB RAM required
  • Tokens/Second: 8-15 on Snapdragon 8cx Gen 2
  • Latency: <100ms for short sequences

πŸ“ File Structure

model.onnx              # Main ONNX model file
tokenizer.json          # Tokenizer vocabulary
tokenizer_config.json   # Tokenizer configuration
config.json             # Model configuration
onnx__MatMul_*         # External weight data files (129 files)
*.weight               # Additional model weights

⚠️ Important Notes

  1. All Files Required: Keep all files in the same directory. The model.onnx file references external data files.

  2. Memory Requirements: Ensure you have at least 4GB of available RAM.

  3. Qualcomm NPU Setup: For optimal performance on Qualcomm hardware:

# Use QNN execution provider (when available)
providers = ['QNNExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession("model.onnx", providers=providers)

πŸš€ Deployment on Qualcomm Devices

Windows on ARM

  1. Copy all files to your device
  2. Install ONNX Runtime: pip install onnxruntime
  3. Run the test script to verify

Android (with QNN SDK)

  1. Use ONNX Runtime Mobile with QNN support
  2. Package all files in your app bundle
  3. Initialize with QNN execution provider

πŸ› Troubleshooting

Model fails to load:

  • Ensure all files are in the same directory
  • Check that you have sufficient RAM (4GB+)

Slow inference:

  • Try enabling graph optimizations:
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
session = ort.InferenceSession("model.onnx", sess_options)

Out of memory:

  • Reduce sequence length: max_length=32
  • Process smaller batches

πŸ“„ License

This model inherits the license from microsoft/Phi-3.5-mini-instruct.


Quantized and optimized for Qualcomm Snapdragon NPU deployment

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support