File size: 3,822 Bytes
e6fbd97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- quantized
- onnx
- clustering
model-index:
- name: sentence-transformers/all-MiniLM-L6-v2-quantized
  results:
  - task:
      type: semantic-similarity
      name: Semantic Similarity
    dataset:
      type: semantic-similarity
      name: Semantic Similarity
    metrics:
    - type: similarity
      value: 0.95+
      name: Cosine Similarity (vs Original)
---

# Quantized SentenceTransformer: all-MiniLM-L6-v2

This is a quantized version of the popular [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, optimized for production deployment.

## Model Details

- **Base Model**: sentence-transformers/all-MiniLM-L6-v2
- **Quantization**: INT8 dynamic quantization using ONNX Runtime
- **Size Reduction**: ~75% smaller than the original model
- **Performance**: 95%+ similarity to original model embeddings
- **Format**: ONNX

## Files

- `model-quant.onnx`: Quantized INT8 model (recommended for production)
- `model.onnx`: Original FP32 ONNX model

## Usage

### With ONNX Runtime (Recommended)

```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

# Load the quantized model
session = ort.InferenceSession("model-quant.onnx")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def encode_text(text):
    # Tokenize
    inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
    
    # Run inference
    outputs = session.run(None, {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"]
    })
    
    # Apply mean pooling
    last_hidden_state = outputs[0]
    attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1)
    attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape)
    
    masked_embeddings = last_hidden_state * attention_mask_expanded
    summed = np.sum(masked_embeddings, axis=1)
    summed_mask = np.sum(attention_mask_expanded, axis=1)
    embedding = summed / np.maximum(summed_mask, 1e-9)
    
    return embedding[0]

# Example usage
text = "I love this product!"
embedding = encode_text(text)
print(f"Embedding shape: {embedding.shape}")
```

### With SentenceTransformers (Original)

For comparison with the original model:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embedding = model.encode("I love this product!")
```

## Performance Comparison

| Model | Size | Inference Speed | Memory Usage | Similarity to Original |
|-------|------|----------------|--------------|----------------------|
| Original | ~90MB | 1.0x | 1.0x | 100% |
| Quantized | ~23MB | 1.2-1.5x | 0.6x | 95%+ |

## Use Cases

- **Text Clustering**: Group similar texts together
- **Semantic Search**: Find semantically similar documents
- **Recommendation Systems**: Content-based recommendations
- **Duplicate Detection**: Find near-duplicate texts

## Technical Details

- **Embedding Dimension**: 384
- **Max Sequence Length**: 512 tokens
- **Quantization Method**: Dynamic INT8 quantization
- **Framework**: ONNX Runtime

## Citation

If you use this model, please cite the original work:

```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}
```