--- library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - sentence-similarity - quantized - onnx - clustering model-index: - name: sentence-transformers/all-MiniLM-L6-v2-quantized results: - task: type: semantic-similarity name: Semantic Similarity dataset: type: semantic-similarity name: Semantic Similarity metrics: - type: similarity value: 0.95+ name: Cosine Similarity (vs Original) --- # Quantized SentenceTransformer: all-MiniLM-L6-v2 This is a quantized version of the popular [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, optimized for production deployment. ## Model Details - **Base Model**: sentence-transformers/all-MiniLM-L6-v2 - **Quantization**: INT8 dynamic quantization using ONNX Runtime - **Size Reduction**: ~75% smaller than the original model - **Performance**: 95%+ similarity to original model embeddings - **Format**: ONNX ## Files - `model-quant.onnx`: Quantized INT8 model (recommended for production) - `model.onnx`: Original FP32 ONNX model ## Usage ### With ONNX Runtime (Recommended) ```python import onnxruntime as ort import numpy as np from transformers import AutoTokenizer # Load the quantized model session = ort.InferenceSession("model-quant.onnx") tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") def encode_text(text): # Tokenize inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512) # Run inference outputs = session.run(None, { "input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"] }) # Apply mean pooling last_hidden_state = outputs[0] attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1) attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape) masked_embeddings = last_hidden_state * attention_mask_expanded summed = np.sum(masked_embeddings, axis=1) summed_mask = np.sum(attention_mask_expanded, axis=1) embedding = summed / np.maximum(summed_mask, 1e-9) return embedding[0] # Example usage text = "I love this product!" embedding = encode_text(text) print(f"Embedding shape: {embedding.shape}") ``` ### With SentenceTransformers (Original) For comparison with the original model: ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') embedding = model.encode("I love this product!") ``` ## Performance Comparison | Model | Size | Inference Speed | Memory Usage | Similarity to Original | |-------|------|----------------|--------------|----------------------| | Original | ~90MB | 1.0x | 1.0x | 100% | | Quantized | ~23MB | 1.2-1.5x | 0.6x | 95%+ | ## Use Cases - **Text Clustering**: Group similar texts together - **Semantic Search**: Find semantically similar documents - **Recommendation Systems**: Content-based recommendations - **Duplicate Detection**: Find near-duplicate texts ## Technical Details - **Embedding Dimension**: 384 - **Max Sequence Length**: 512 tokens - **Quantization Method**: Dynamic INT8 quantization - **Framework**: ONNX Runtime ## Citation If you use this model, please cite the original work: ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "http://arxiv.org/abs/1908.10084", } ```