File size: 3,822 Bytes
e6fbd97 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- quantized
- onnx
- clustering
model-index:
- name: sentence-transformers/all-MiniLM-L6-v2-quantized
results:
- task:
type: semantic-similarity
name: Semantic Similarity
dataset:
type: semantic-similarity
name: Semantic Similarity
metrics:
- type: similarity
value: 0.95+
name: Cosine Similarity (vs Original)
---
# Quantized SentenceTransformer: all-MiniLM-L6-v2
This is a quantized version of the popular [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, optimized for production deployment.
## Model Details
- **Base Model**: sentence-transformers/all-MiniLM-L6-v2
- **Quantization**: INT8 dynamic quantization using ONNX Runtime
- **Size Reduction**: ~75% smaller than the original model
- **Performance**: 95%+ similarity to original model embeddings
- **Format**: ONNX
## Files
- `model-quant.onnx`: Quantized INT8 model (recommended for production)
- `model.onnx`: Original FP32 ONNX model
## Usage
### With ONNX Runtime (Recommended)
```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
# Load the quantized model
session = ort.InferenceSession("model-quant.onnx")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def encode_text(text):
# Tokenize
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
# Run inference
outputs = session.run(None, {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"]
})
# Apply mean pooling
last_hidden_state = outputs[0]
attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1)
attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape)
masked_embeddings = last_hidden_state * attention_mask_expanded
summed = np.sum(masked_embeddings, axis=1)
summed_mask = np.sum(attention_mask_expanded, axis=1)
embedding = summed / np.maximum(summed_mask, 1e-9)
return embedding[0]
# Example usage
text = "I love this product!"
embedding = encode_text(text)
print(f"Embedding shape: {embedding.shape}")
```
### With SentenceTransformers (Original)
For comparison with the original model:
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embedding = model.encode("I love this product!")
```
## Performance Comparison
| Model | Size | Inference Speed | Memory Usage | Similarity to Original |
|-------|------|----------------|--------------|----------------------|
| Original | ~90MB | 1.0x | 1.0x | 100% |
| Quantized | ~23MB | 1.2-1.5x | 0.6x | 95%+ |
## Use Cases
- **Text Clustering**: Group similar texts together
- **Semantic Search**: Find semantically similar documents
- **Recommendation Systems**: Content-based recommendations
- **Duplicate Detection**: Find near-duplicate texts
## Technical Details
- **Embedding Dimension**: 384
- **Max Sequence Length**: 512 tokens
- **Quantization Method**: Dynamic INT8 quantization
- **Framework**: ONNX Runtime
## Citation
If you use this model, please cite the original work:
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "http://arxiv.org/abs/1908.10084",
}
```
|