Upload folder using huggingface_hub

e6fbd97 verified 18 days ago

3.82 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- quantized
	- onnx
	- clustering
	model-index:
	- name: sentence-transformers/all-MiniLM-L6-v2-quantized
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	type: semantic-similarity
	name: Semantic Similarity
	metrics:
	- type: similarity
	value: 0.95+
	name: Cosine Similarity (vs Original)
	---

	# Quantized SentenceTransformer: all-MiniLM-L6-v2

	This is a quantized version of the popular [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model, optimized for production deployment.

	## Model Details

	- Base Model: sentence-transformers/all-MiniLM-L6-v2
	- Quantization: INT8 dynamic quantization using ONNX Runtime
	- Size Reduction: ~75% smaller than the original model
	- Performance: 95%+ similarity to original model embeddings
	- Format: ONNX

	## Files

	- `model-quant.onnx`: Quantized INT8 model (recommended for production)
	- `model.onnx`: Original FP32 ONNX model

	## Usage

	### With ONNX Runtime (Recommended)

	```python
	import onnxruntime as ort
	import numpy as np
	from transformers import AutoTokenizer

	# Load the quantized model
	session = ort.InferenceSession("model-quant.onnx")
	tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

	def encode_text(text):
	# Tokenize
	inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True, max_length=512)

	# Run inference
	outputs = session.run(None, {
	"input_ids": inputs["input_ids"],
	"attention_mask": inputs["attention_mask"]
	})

	# Apply mean pooling
	last_hidden_state = outputs[0]
	attention_mask_expanded = np.expand_dims(inputs["attention_mask"], -1)
	attention_mask_expanded = np.broadcast_to(attention_mask_expanded, last_hidden_state.shape)

	masked_embeddings = last_hidden_state * attention_mask_expanded
	summed = np.sum(masked_embeddings, axis=1)
	summed_mask = np.sum(attention_mask_expanded, axis=1)
	embedding = summed / np.maximum(summed_mask, 1e-9)

	return embedding[0]

	# Example usage
	text = "I love this product!"
	embedding = encode_text(text)
	print(f"Embedding shape: {embedding.shape}")
	```

	### With SentenceTransformers (Original)

	For comparison with the original model:

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
	embedding = model.encode("I love this product!")
	```

	## Performance Comparison

	\| Model \| Size \| Inference Speed \| Memory Usage \| Similarity to Original \|
	\|-------\|------\|----------------\|--------------\|----------------------\|
	\| Original \| ~90MB \| 1.0x \| 1.0x \| 100% \|
	\| Quantized \| ~23MB \| 1.2-1.5x \| 0.6x \| 95%+ \|

	## Use Cases

	- Text Clustering: Group similar texts together
	- Semantic Search: Find semantically similar documents
	- Recommendation Systems: Content-based recommendations
	- Duplicate Detection: Find near-duplicate texts

	## Technical Details

	- Embedding Dimension: 384
	- Max Sequence Length: 512 tokens
	- Quantization Method: Dynamic INT8 quantization
	- Framework: ONNX Runtime

	## Citation

	If you use this model, please cite the original work:

	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "http://arxiv.org/abs/1908.10084",
	}
	```