ONNX Model Inference Example
#6
by
supreethrao
- opened
Hi,
It would be great if there was an example using the onnx version of the model given that the sentence-transformers version requires some transformation of the output of model.encode() to get the vectors.
Thanks!
Can sentence transformers be used with Onnx? I am not aware of that. If you wanted to use the onnx model, you can use something like Triton but last time I tried was a bit painful to setup: https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/ONNX/README.md
zpn
changed discussion status to
closed
I was able to use the ONNX model with Huggingface's Optimum library, like so:
- Install all required dependencies for loading the model with Huggingface Transformers, e.g.
transformers
,torch
etc. - Install Huggingface Optimum:
pip install optimum[onnxruntime-gpu]
- this one if you're using a GPU to run the model - Install sentence-transformers:
pip install sentence-transformers
- Load the tokenizer and model and perform inference with the model, with mean pooling of embeddings and normalization (skip these if you don't need them):
from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction
import torch
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained(
"bert-base-uncased",
model_max_length=8192
)
model = ORTModelForFeatureExtraction.from_pretrained(
"nomic-ai/nomic-embed-text-v1",
file_name="onnx/model.onnx",
provider="CUDAExecutionProvider", # change this if you want to use a different backend
trust_remote_code=True,
rotary_scaling_factor=2
)
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
texts = ["text 1 ....", "text 2 ...."]
inputs = self.tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=8192
)
inputs = inputs.to(torch.device('cuda'))
with torch.no_grad():
model_output = model(**inputs)
embeddings = mean_pooling(model_output, inputs['attention_mask'])
normalized_embeddings = F.normalize(embeddings, p=2, dim=1).cpu().numpy().tolist()