--- license: mit language: - en --- # BGE-large-en-v1.5-rag-int8-static A quantized version of [BAAI/BGE-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) quantized with [IntelĀ® Neural Compressor](https://github.com/huggingface/optimum-intel) and compatible with [Optimum-Intel](https://github.com/huggingface/optimum-intel). The model can be used with [Optimum-Intel](https://github.com/huggingface/optimum-intel) API and as a standalone model or as an embedder or ranker module as part of [fastRAG](https://github.com/IntelLabs/fastRAG) RAG pipeline. ## Technical details Quantized using post-training static quantization. | | | |---|:---:| | Calibration set | [qasper](https://huggingface.co/datasets/allenai/qasper) (with 100 random samples)" | | Quantization tool | [Optimum-Intel](https://github.com/huggingface/optimum-intel) | | Backend | `IPEX` | | Original model | [BAAI/BGE-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | Instructions how to reproduce the quantized model can be found [here](https://github.com/IntelLabs/fastRAG/tree/main/scripts/optimizations/embedders). ## Evaluation - MTEB Model performance on the [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) *retrieval* and *reranking* tasks. | | `INT8` | `FP32` | % diff | |---|:---:|:---:|:---:| | Reranking | 0.5997 | 0.6003 | -0.108% | | Retrieval | 0.5346 | 0.5429 | -1.53% | ## Usage ### Using with Optimum-intel See [Optimum-intel](https://github.com/huggingface/optimum-intel) installation page for instructions how to install. Or run: ``` sh pip install -U optimum[neural-compressor, ipex] intel-extension-for-transformers ``` Loading a model: ``` python from optimum.intel import IPEXModel model = IPEXModel.from_pretrained("Intel/bge-large-en-v1.5-rag-int8-static") ``` Running inference: ``` python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("Intel/bge-large-en-v1.5-rag-int8-static") inputs = tokenizer(sentences, return_tensors='pt') with torch.no_grad(): outputs = model(**inputs) # get the vector of [CLS] embedded = model_output[0][:, 0] ``` ### Using with a fastRAG RAG pipeline Get started with installing [fastRAG](https://github.com/IntelLabs/fastRAG) as instructed [here](https://github.com/IntelLabs/fastRAG). Below is an example for loading the model into a ranker node that embeds and re-ranks all the documents it gets in the node input of a pipeline. ``` python from fastrag.rankers import QuantizedBiEncoderRanker ranker = QuantizedBiEncoderRanker("Intel/bge-large-en-v1.5-rag-int8-static") ``` and plugging it into a pipeline ``` python from haystack import Pipeline p = Pipeline() p.add_node(component=retriever, name="retriever", inputs=["Query"]) p.add_node(component=ranker, name="ranker", inputs=["retriever"]) ``` See a more complete example notebook [here](https://github.com/IntelLabs/fastRAG/blob/main/examples/optimized-embeddings.ipynb).