|
--- |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- quantized |
|
- onnx |
|
- clustering |
|
model-index: |
|
- name: sentence-transformers/all-MiniLM-L6-v2-quantized |
|
results: |
|
- task: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
dataset: |
|
type: semantic-similarity |
|
name: Semantic Similarity |
|
metrics: |
|
- type: similarity |
|
value: 0.95+ |
|
name: Cosine Similarity (vs Original) |
|
--- |
|
|
|
# Quantized SentenceTransformer: all-MiniLM-L6-v2 |
|
|
|
This is a quantized version of the popular [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, optimized for production deployment. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: sentence-transformers/all-MiniLM-L6-v2 |
|
- **Quantization**: INT8 dynamic quantization using ONNX Runtime |
|
- **Size Reduction**: ~75% smaller than the original model |
|
- **Performance**: 95%+ similarity to original model embeddings |
|
- **Format**: ONNX |
|
|
|
## Files |
|
|
|
- `model-quant.onnx`: Quantized INT8 model (recommended for production) |
|
- `model.onnx`: Original FP32 ONNX model |
|
|
|
## Usage |
|
|
|
### With ONNX Runtime (Recommended) |
|
|
|
```python |
|
import onnxruntime as ort |
|
import numpy as np |
|
from transformers import AutoTokenizer |
|
|
|
# Load the quantized model |
|
session = ort.InferenceSession("model-quant.onnx") |
|
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") |
|
|
|
def encode_text(text): |
|
# Tokenize |
|
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512) |
|
|
|
# Run inference |
|
outputs = session.run(None, { |
|
"input_ids": inputs["input_ids"], |
|
"attention_mask": inputs["attention_mask"] |
|
}) |
|
|
|
# Apply mean pooling |
|
last_hidden_state = outputs[0] |
|
attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1) |
|
attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape) |
|
|
|
masked_embeddings = last_hidden_state * attention_mask_expanded |
|
summed = np.sum(masked_embeddings, axis=1) |
|
summed_mask = np.sum(attention_mask_expanded, axis=1) |
|
embedding = summed / np.maximum(summed_mask, 1e-9) |
|
|
|
return embedding[0] |
|
|
|
# Example usage |
|
text = "I love this product!" |
|
embedding = encode_text(text) |
|
print(f"Embedding shape: {embedding.shape}") |
|
``` |
|
|
|
### With SentenceTransformers (Original) |
|
|
|
For comparison with the original model: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') |
|
embedding = model.encode("I love this product!") |
|
``` |
|
|
|
## Performance Comparison |
|
|
|
| Model | Size | Inference Speed | Memory Usage | Similarity to Original | |
|
|-------|------|----------------|--------------|----------------------| |
|
| Original | ~90MB | 1.0x | 1.0x | 100% | |
|
| Quantized | ~23MB | 1.2-1.5x | 0.6x | 95%+ | |
|
|
|
## Use Cases |
|
|
|
- **Text Clustering**: Group similar texts together |
|
- **Semantic Search**: Find semantically similar documents |
|
- **Recommendation Systems**: Content-based recommendations |
|
- **Duplicate Detection**: Find near-duplicate texts |
|
|
|
## Technical Details |
|
|
|
- **Embedding Dimension**: 384 |
|
- **Max Sequence Length**: 512 tokens |
|
- **Quantization Method**: Dynamic INT8 quantization |
|
- **Framework**: ONNX Runtime |
|
|
|
## Citation |
|
|
|
If you use this model, please cite the original work: |
|
|
|
```bibtex |
|
@inproceedings{reimers-2019-sentence-bert, |
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
month = "11", |
|
year = "2019", |
|
publisher = "Association for Computational Linguistics", |
|
url = "http://arxiv.org/abs/1908.10084", |
|
} |
|
``` |
|
|