Ayeshas21's picture
Upload folder using huggingface_hub
e6fbd97 verified
metadata
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - quantized
  - onnx
  - clustering
model-index:
  - name: sentence-transformers/all-MiniLM-L6-v2-quantized
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          type: semantic-similarity
          name: Semantic Similarity
        metrics:
          - type: similarity
            value: 0.95+
            name: Cosine Similarity (vs Original)

Quantized SentenceTransformer: all-MiniLM-L6-v2

This is a quantized version of the popular sentence-transformers/all-MiniLM-L6-v2 model, optimized for production deployment.

Model Details

  • Base Model: sentence-transformers/all-MiniLM-L6-v2
  • Quantization: INT8 dynamic quantization using ONNX Runtime
  • Size Reduction: ~75% smaller than the original model
  • Performance: 95%+ similarity to original model embeddings
  • Format: ONNX

Files

  • model-quant.onnx: Quantized INT8 model (recommended for production)
  • model.onnx: Original FP32 ONNX model

Usage

With ONNX Runtime (Recommended)

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

# Load the quantized model
session = ort.InferenceSession("model-quant.onnx")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def encode_text(text):
    # Tokenize
    inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)
    
    # Run inference
    outputs = session.run(None, {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"]
    })
    
    # Apply mean pooling
    last_hidden_state = outputs[0]
    attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1)
    attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape)
    
    masked_embeddings = last_hidden_state * attention_mask_expanded
    summed = np.sum(masked_embeddings, axis=1)
    summed_mask = np.sum(attention_mask_expanded, axis=1)
    embedding = summed / np.maximum(summed_mask, 1e-9)
    
    return embedding[0]

# Example usage
text = "I love this product!"
embedding = encode_text(text)
print(f"Embedding shape: {embedding.shape}")

With SentenceTransformers (Original)

For comparison with the original model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embedding = model.encode("I love this product!")

Performance Comparison

Model Size Inference Speed Memory Usage Similarity to Original
Original ~90MB 1.0x 1.0x 100%
Quantized ~23MB 1.2-1.5x 0.6x 95%+

Use Cases

  • Text Clustering: Group similar texts together
  • Semantic Search: Find semantically similar documents
  • Recommendation Systems: Content-based recommendations
  • Duplicate Detection: Find near-duplicate texts

Technical Details

  • Embedding Dimension: 384
  • Max Sequence Length: 512 tokens
  • Quantization Method: Dynamic INT8 quantization
  • Framework: ONNX Runtime

Citation

If you use this model, please cite the original work:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "http://arxiv.org/abs/1908.10084",
}