BGE-M3 ONNX for AWS Graviton4

This is an ONNX-optimized version of BAAI/bge-m3 specifically optimized for AWS Graviton4 processors.

Model Description

BGE-M3 is a versatile embedding model from the FlagEmbedding project that supports:

  • Dense retrieval: Traditional sentence embeddings
  • Sparse retrieval: Lexical matching with learnable sparse weights
  • Multi-vector retrieval: ColBERT-style token-level embeddings
  • Multi-lingual support: Over 100 languages

Optimization Details

  • ONNX Opset: 17
  • Optimization Level: O3 (GELU approximation enabled for BGE-M3)
  • Target Hardware: AWS Graviton4 (ARM64 with bfloat16 support)
  • Model Size: ~2.2GB
  • Quantized Version Available: INT8 quantized model in quantized/ subdirectory

Usage

Quick Start

from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer

# Load from Hugging Face Hub
model = ORTModelForCustomTasks.from_pretrained(
    "idomeneo/bge-m3-onnx-graviton4",
    file_name="model_optimized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")

# Tokenize and get embeddings
inputs = tokenizer("Your text here", return_tensors="np", padding=True, truncation=True)
outputs = model.forward(**inputs)

# Access different embedding types
dense_embeddings = outputs["dense_vecs"]    # Shape: (batch_size, 1024)
sparse_embeddings = outputs["sparse_vecs"]   # Shape: (batch_size, seq_len, 1)
colbert_embeddings = outputs["colbert_vecs"] # Shape: (batch_size, seq_len, 1024)

AWS Graviton4 with bfloat16 Acceleration

import onnxruntime as ort
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer

# Enable bfloat16 acceleration for Graviton4
sess_options = ort.SessionOptions()
sess_options.add_session_config_entry("mlas.enable_gemm_fastmath_arm64_bfloat16", "1")

model = ORTModelForCustomTasks.from_pretrained(
    "idomeneo/bge-m3-onnx-graviton4",
    file_name="model_optimized.onnx",
    session_options=sess_options
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")

Using the Quantized Model

The INT8 quantized model provides significant memory savings and faster inference with minimal quality loss:

from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer

# Load quantized model
model = ORTModelForCustomTasks.from_pretrained(
    "idomeneo/bge-m3-onnx-graviton4",
    subfolder="quantized",
    file_name="model_optimized_quantized.onnx"
)
tokenizer = AutoTokenizer.from_pretrained("idomeneo/bge-m3-onnx-graviton4")

# Usage is identical to the standard model
inputs = tokenizer("Your text here", return_tensors="np", padding=True, truncation=True)
outputs = model.forward(**inputs)

Quantization Details

  • Quantization Type: Static INT8 quantization
  • Calibration Dataset: GLUE SST-2 (300 samples)
  • Per-channel: Enabled for better accuracy
  • Quality: 99.98% similarity with original model on diverse test cases
  • Performance: ~2-4x faster inference on Graviton4 processors

Quantization Quality Results

Comprehensive testing across 20 diverse examples shows excellent quality retention:

Test Category Examples Avg Similarity
English Technical Machine learning, neural networks, NLP 99.98%
English General Common phrases, news topics 99.97%
Multilingual Chinese, Spanish, French, German, Japanese 99.97%
Domain Specific SQL queries, Python code, Biology 99.98%
Edge Cases Single char, emojis, repetitions 99.97%
Semantic Variations Paraphrases 99.99%

Overall Statistics:

  • Average similarity: 99.98%
  • Minimum similarity: 99.95%
  • Maximum similarity: 99.99%
  • Standard deviation: 0.01%
  • All test cases maintain > 99.95% similarity

Performance

On AWS Graviton4 instances, this optimized model provides:

  • Up to 3x faster inference compared to PyTorch
  • Reduced memory footprint
  • Native bfloat16 acceleration support

Model Files

Standard Model

  • model_optimized.onnx: O3-optimized ONNX model with GELU approximation
  • model_optimized.onnx.data: External weights file
  • config.json: Model configuration
  • tokenizer.json: Fast tokenizer
  • tokenizer_config.json: Tokenizer configuration
  • sentencepiece.bpe.model: SentencePiece model
  • special_tokens_map.json: Special tokens mapping
  • ort_config.json: ONNX Runtime configuration

Quantized Model (in quantized/ subdirectory)

  • model_optimized_quantized.onnx: INT8 quantized model
  • model_optimized_quantized.onnx.data: Quantized weights
  • All tokenizer files are shared with the standard model

Citation

@article{bge-m3,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2402.03216},
  year={2024}
}

License

MIT License (inherited from BAAI/bge-m3)

Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for idomeneo/bge-m3-onnx-graviton4

Base model

BAAI/bge-m3
Quantized
(59)
this model