|
--- |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
license: mit |
|
--- |
|
|
|
# BAAI-Multilingual-Base |
|
|
|
**BAAI-Multilingual-Base** is a text embedding model distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity. |
|
|
|
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. |
|
- Multi-Linguality: It can support more than 100 working languages. |
|
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. |
|
|
|
|
|
## Usage |
|
|
|
Install: |
|
``` |
|
pip install -U FlagEmbedding |
|
``` |
|
|
|
### Generate Embedding for text |
|
|
|
- Dense Embedding |
|
```python |
|
from FlagEmbedding import BGEM3FlagModel |
|
|
|
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', |
|
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation |
|
|
|
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] |
|
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", |
|
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] |
|
|
|
embeddings_1 = model.encode(sentences_1, |
|
batch_size=12, |
|
max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process. |
|
)['dense_vecs'] |
|
embeddings_2 = model.encode(sentences_2)['dense_vecs'] |
|
similarity = embeddings_1 @ embeddings_2.T |
|
print(similarity) |
|
# [[0.7026 0.439 ] |
|
# [0.361 0.678 ]] |
|
``` |
|
You also can use sentence-transformers and huggingface transformers to generate dense embeddings. |
|
|
|
|
|
- Sparse Embedding (Lexical Weight) |
|
```python |
|
from FlagEmbedding import BGEM3FlagModel |
|
|
|
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', |
|
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation |
|
|
|
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] |
|
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", |
|
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] |
|
|
|
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False) |
|
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False) |
|
|
|
# you can see the weight for each token: |
|
print(model.convert_id_to_token(output_1['lexical_weights'])) |
|
# [{'What': 0.10126, 'is': 0.1063, 'BA': 0.1858, 'AI': 0.2576, '-': 0.05154, 'Mul': 0.1381, 'ti': 0.1404, 'lingu': 0.2734, 'al': 0.10095, |
|
# 'Bas': 0.2299, 'e': 0.153, '?': 0.05536}, {'De': 0.05002, 'fin': 0.1368, 'ation': 0.04495, 'of': 0.0633, 'BM': 0.2517, '25': 0.3333}] |
|
|
|
|
|
# compute the scores via lexical mathcing |
|
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0]) |
|
print(lexical_scores) |
|
# 0.3666038513183594 |
|
|
|
print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1])) |
|
# 0.0 |
|
``` |
|
|
|
- Multi-Vector (ColBERT) |
|
```python |
|
from FlagEmbedding import BGEM3FlagModel |
|
|
|
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', |
|
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation |
|
|
|
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] |
|
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", |
|
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] |
|
|
|
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True) |
|
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True) |
|
|
|
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0])) |
|
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1])) |
|
# 0.7982 |
|
# 0.4389 |
|
``` |
|
|
|
|
|
### Compute score for text pairs |
|
Input a list of text pairs, you can get the scores computed by different methods. |
|
```python |
|
from FlagEmbedding import BGEM3FlagModel |
|
|
|
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', |
|
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation |
|
|
|
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] |
|
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", |
|
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] |
|
|
|
sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2] |
|
|
|
print(model.compute_score(sentence_pairs, |
|
max_passage_length=128, # a smaller max length leads to a lower latency |
|
weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score |
|
|
|
# { |
|
# 'colbert': [0.7982305884361267, 0.438856840133667, 0.4464578628540039, 0.7897794842720032], |
|
# 'sparse': [0.366455078125, 0.01297760009765625, 0.0, 0.1802978515625], |
|
# 'dense': [0.70263671875, 0.43896484375, 0.361083984375, 0.67822265625], |
|
# 'sparse+dense': [0.5905762314796448, 0.29696908593177795, 0.2407226711511612, 0.5122477412223816], |
|
# 'colbert+sparse+dense': [0.6736379861831665, 0.3537241816520691, 0.3230167627334595, 0.6232604384422302] |
|
# } |
|
``` |
|
|