--- library_name: optimum.onnxruntime tags: - onnx - int8 - quantization - embeddings - cpu pipeline_tag: feature-extraction license: apache-2.0 base_model: ibm-granite/granite-embedding-english-r2 --- # Granite Embedding English R2 — INT8 (ONNX) This is the **INT8-quantized ONNX version** of [`ibm-granite/granite-embedding-english-r2`](https://huggingface.co/ibm-granite/granite-embedding-english-r2). It is optimized to run efficiently on **CPU** using [🤗 Optimum](https://huggingface.co/docs/optimum) with ONNX Runtime. - **Embedding dimension:** 768 - **Precision:** INT8 (dynamic quantization) - **Backend:** ONNX Runtime - **Use case:** text embeddings, semantic search, clustering, retrieval --- ## 📥 Installation ```bash pip install -U transformers optimum[onnxruntime] ```` --- ## 🚀 Usage ```python from transformers import AutoTokenizer from optimum.onnxruntime import ORTModelForFeatureExtraction repo_id = "yasserrmd/granite-embedding-r2-onnx" # Load tokenizer + ONNX model tokenizer = AutoTokenizer.from_pretrained(repo_id) model = ORTModelForFeatureExtraction.from_pretrained(repo_id) # Encode sentences inputs = tokenizer(["Hello world", "مرحباً"], padding=True, return_tensors="pt") outputs = model(**inputs) # Apply mean pooling over tokens embeddings = outputs.last_hidden_state.mean(dim=1) print(embeddings.shape) # (2, 768) ``` --- ## ✅ Notes * Quantization reduces model size and makes inference faster on CPUs while preserving accuracy. * Pooling strategy here is **mean pooling**; you can adapt CLS pooling or max pooling as needed. * Works seamlessly with **Hugging Face Hub** + `optimum.onnxruntime`. --- ## 📚 References * [Original Granite Embedding English R2](https://huggingface.co/ibm-granite/granite-embedding-english-r2) * [Optimum ONNX Runtime docs](https://huggingface.co/docs/optimum/onnxruntime/usage_guides/models)