--- pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity license: mit --- # BAAI-Multilingual-Base **BAAI-Multilingual-Base** is a text embedding model distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity. - Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. - Multi-Linguality: It can support more than 100 working languages. - Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. ## Usage Install: ``` pip install -U FlagEmbedding ``` ### Generate Embedding for text - Dense Embedding ```python from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] embeddings_1 = model.encode(sentences_1, batch_size=12, max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process. )['dense_vecs'] embeddings_2 = model.encode(sentences_2)['dense_vecs'] similarity = embeddings_1 @ embeddings_2.T print(similarity) # [[0.7026 0.439 ] # [0.361 0.678 ]] ``` You also can use sentence-transformers and huggingface transformers to generate dense embeddings. - Sparse Embedding (Lexical Weight) ```python from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False) output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False) # you can see the weight for each token: print(model.convert_id_to_token(output_1['lexical_weights'])) # [{'What': 0.10126, 'is': 0.1063, 'BA': 0.1858, 'AI': 0.2576, '-': 0.05154, 'Mul': 0.1381, 'ti': 0.1404, 'lingu': 0.2734, 'al': 0.10095, # 'Bas': 0.2299, 'e': 0.153, '?': 0.05536}, {'De': 0.05002, 'fin': 0.1368, 'ation': 0.04495, 'of': 0.0633, 'BM': 0.2517, '25': 0.3333}] # compute the scores via lexical mathcing lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0]) print(lexical_scores) # 0.3666038513183594 print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1])) # 0.0 ``` - Multi-Vector (ColBERT) ```python from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True) output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True) print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0])) print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1])) # 0.7982 # 0.4389 ``` ### Compute score for text pairs Input a list of text pairs, you can get the scores computed by different methods. ```python from FlagEmbedding import BGEM3FlagModel model = BGEM3FlagModel('hanhainebula/baai-multilingual-base', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"] sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"] sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2] print(model.compute_score(sentence_pairs, max_passage_length=128, # a smaller max length leads to a lower latency weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score # { # 'colbert': [0.7982305884361267, 0.438856840133667, 0.4464578628540039, 0.7897794842720032], # 'sparse': [0.366455078125, 0.01297760009765625, 0.0, 0.1802978515625], # 'dense': [0.70263671875, 0.43896484375, 0.361083984375, 0.67822265625], # 'sparse+dense': [0.5905762314796448, 0.29696908593177795, 0.2407226711511612, 0.5122477412223816], # 'colbert+sparse+dense': [0.6736379861831665, 0.3537241816520691, 0.3230167627334595, 0.6232604384422302] # } ```