metadata
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- quantized
- onnx
- clustering
model-index:
- name: sentence-transformers/all-MiniLM-L6-v2-quantized
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
type: semantic-similarity
name: Semantic Similarity
metrics:
- type: similarity
value: 0.95+
name: Cosine Similarity (vs Original)
Quantized SentenceTransformer: all-MiniLM-L6-v2
This is a quantized version of the popular sentence-transformers/all-MiniLM-L6-v2 model, optimized for production deployment.
Model Details
- Base Model: sentence-transformers/all-MiniLM-L6-v2
- Quantization: INT8 dynamic quantization using ONNX Runtime
- Size Reduction: ~75% smaller than the original model
- Performance: 95%+ similarity to original model embeddings
- Format: ONNX
Files
model-quant.onnx
: Quantized INT8 model (recommended for production)model.onnx
: Original FP32 ONNX model
Usage
With ONNX Runtime (Recommended)
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
# Load the quantized model
session = ort.InferenceSession("model-quant.onnx")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def encode_text(text):
# Tokenize
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
# Run inference
outputs = session.run(None, {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"]
})
# Apply mean pooling
last_hidden_state = outputs[0]
attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1)
attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape)
masked_embeddings = last_hidden_state * attention_mask_expanded
summed = np.sum(masked_embeddings, axis=1)
summed_mask = np.sum(attention_mask_expanded, axis=1)
embedding = summed / np.maximum(summed_mask, 1e-9)
return embedding[0]
# Example usage
text = "I love this product!"
embedding = encode_text(text)
print(f"Embedding shape: {embedding.shape}")
With SentenceTransformers (Original)
For comparison with the original model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embedding = model.encode("I love this product!")
Performance Comparison
Model | Size | Inference Speed | Memory Usage | Similarity to Original |
---|---|---|---|---|
Original | ~90MB | 1.0x | 1.0x | 100% |
Quantized | ~23MB | 1.2-1.5x | 0.6x | 95%+ |
Use Cases
- Text Clustering: Group similar texts together
- Semantic Search: Find semantically similar documents
- Recommendation Systems: Content-based recommendations
- Duplicate Detection: Find near-duplicate texts
Technical Details
- Embedding Dimension: 384
- Max Sequence Length: 512 tokens
- Quantization Method: Dynamic INT8 quantization
- Framework: ONNX Runtime
Citation
If you use this model, please cite the original work:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}