BGE-M3 ONNX for AWS Graviton4
This is an ONNX-optimized version of BAAI/bge-m3 specifically optimized for AWS Graviton4 processors.
Model Description
BGE-M3 is a versatile embedding model from the FlagEmbedding project that supports:
- Dense retrieval: Traditional sentence embeddings
- Sparse retrieval: Lexical matching with learnable sparse weights
- Multi-vector retrieval: ColBERT-style token-level embeddings
- Multi-lingual support: Over 100 languages
Optimization Details
- ONNX Opset: 17
- Optimization Level: O3 (GELU approximation enabled for BGE-M3)
- Target Hardware: AWS Graviton4 (ARM64 with bfloat16 support)
- Model Size: ~2.2GB
- Quantized Version Available: INT8 quantized model in
quantized/
subdirectory
Usage
Quick Start
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer
# Load from Hugging Face Hub
model = ORTModelForCustomTasks.from_pretrained(
"idomeneo/bge-m3-onnx-graviton4",
file_name="model_optimized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")
# Tokenize and get embeddings
inputs = tokenizer("Your text here", return_tensors="np", padding=True, truncation=True)
outputs = model.forward(**inputs)
# Access different embedding types
dense_embeddings = outputs["dense_vecs"] # Shape: (batch_size, 1024)
sparse_embeddings = outputs["sparse_vecs"] # Shape: (batch_size, seq_len, 1)
colbert_embeddings = outputs["colbert_vecs"] # Shape: (batch_size, seq_len, 1024)
AWS Graviton4 with bfloat16 Acceleration
import onnxruntime as ort
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer
# Enable bfloat16 acceleration for Graviton4
sess_options = ort.SessionOptions()
sess_options.add_session_config_entry("mlas.enable_gemm_fastmath_arm64_bfloat16", "1")
model = ORTModelForCustomTasks.from_pretrained(
"idomeneo/bge-m3-onnx-graviton4",
file_name="model_optimized.onnx",
session_options=sess_options
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")
Using the Quantized Model
The INT8 quantized model provides significant memory savings and faster inference with minimal quality loss:
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer
# Load quantized model
model = ORTModelForCustomTasks.from_pretrained(
"idomeneo/bge-m3-onnx-graviton4",
subfolder="quantized",
file_name="model_optimized_quantized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")
# Usage is identical to the standard model
inputs = tokenizer("Your text here", return_tensors="np", padding=True, truncation=True)
outputs = model.forward(**inputs)
Quantization Details
- Quantization Type: Static INT8 quantization
- Calibration Dataset: GLUE SST-2 (300 samples)
- Per-channel: Enabled for better accuracy
- Quality: 99.98% similarity with original model on diverse test cases
- Performance: ~2-4x faster inference on Graviton4 processors
Quantization Quality Results
Comprehensive testing across 20 diverse examples shows excellent quality retention:
Test Category | Examples | Avg Similarity |
---|---|---|
English Technical | Machine learning, neural networks, NLP | 99.98% |
English General | Common phrases, news topics | 99.97% |
Multilingual | Chinese, Spanish, French, German, Japanese | 99.97% |
Domain Specific | SQL queries, Python code, Biology | 99.98% |
Edge Cases | Single char, emojis, repetitions | 99.97% |
Semantic Variations | Paraphrases | 99.99% |
Overall Statistics:
- Average similarity: 99.98%
- Minimum similarity: 99.95%
- Maximum similarity: 99.99%
- Standard deviation: 0.01%
- All test cases maintain > 99.95% similarity
Performance
On AWS Graviton4 instances, this optimized model provides:
- Up to 3x faster inference compared to PyTorch
- Reduced memory footprint
- Native bfloat16 acceleration support
Model Files
Standard Model
model_optimized.onnx
: O3-optimized ONNX model with GELU approximationmodel_optimized.onnx.data
: External weights fileconfig.json
: Model configurationtokenizer.json
: Fast tokenizertokenizer_config.json
: Tokenizer configurationsentencepiece.bpe.model
: SentencePiece modelspecial_tokens_map.json
: Special tokens mappingort_config.json
: ONNX Runtime configuration
Quantized Model (in quantized/
subdirectory)
model_optimized_quantized.onnx
: INT8 quantized modelmodel_optimized_quantized.onnx.data
: Quantized weights- All tokenizer files are shared with the standard model
Citation
@article{bge-m3,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
journal={arXiv preprint arXiv:2402.03216},
year={2024}
}
License
MIT License (inherited from BAAI/bge-m3)
- Downloads last month
- 50
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for idomeneo/bge-m3-onnx-graviton4
Base model
BAAI/bge-m3