Upload folder using huggingface_hub

bc302de verified 2 months ago

6.09 kB

	---
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	license: mit
	---

	# BAAI-Multilingual-Base

	BAAI-Multilingual-Base is a text embedding model distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

	- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
	- Multi-Linguality: It can support more than 100 working languages.
	- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.


	## Usage

	Install:
	```
	pip install -U FlagEmbedding
	```

	### Generate Embedding for text

	- Dense Embedding
	```python
	from FlagEmbedding import BGEM3FlagModel

	model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
	use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
	sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
	"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

	embeddings_1 = model.encode(sentences_1,
	batch_size=12,
	max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
	)['dense_vecs']
	embeddings_2 = model.encode(sentences_2)['dense_vecs']
	similarity = embeddings_1 @ embeddings_2.T
	print(similarity)
	# [[0.7026 0.439 ]
	# [0.361 0.678 ]]
	```
	You also can use sentence-transformers and huggingface transformers to generate dense embeddings.


	- Sparse Embedding (Lexical Weight)
	```python
	from FlagEmbedding import BGEM3FlagModel

	model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
	use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
	sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
	"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

	output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
	output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)

	# you can see the weight for each token:
	print(model.convert_id_to_token(output_1['lexical_weights']))
	# [{'What': 0.10126, 'is': 0.1063, 'BA': 0.1858, 'AI': 0.2576, '-': 0.05154, 'Mul': 0.1381, 'ti': 0.1404, 'lingu': 0.2734, 'al': 0.10095,
	# 'Bas': 0.2299, 'e': 0.153, '?': 0.05536}, {'De': 0.05002, 'fin': 0.1368, 'ation': 0.04495, 'of': 0.0633, 'BM': 0.2517, '25': 0.3333}]


	# compute the scores via lexical mathcing
	lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
	print(lexical_scores)
	# 0.3666038513183594

	print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
	# 0.0
	```

	- Multi-Vector (ColBERT)
	```python
	from FlagEmbedding import BGEM3FlagModel

	model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
	use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
	sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
	"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

	output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
	output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)

	print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
	print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
	# 0.7982
	# 0.4389
	```


	### Compute score for text pairs
	Input a list of text pairs, you can get the scores computed by different methods.
	```python
	from FlagEmbedding import BGEM3FlagModel

	model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
	use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

	sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
	sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
	"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

	sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]

	print(model.compute_score(sentence_pairs,
	max_passage_length=128, # a smaller max length leads to a lower latency
	weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]dense_score + w[1]sparse_score + w[2]*colbert_score

	# {
	# 'colbert': [0.7982305884361267, 0.438856840133667, 0.4464578628540039, 0.7897794842720032],
	# 'sparse': [0.366455078125, 0.01297760009765625, 0.0, 0.1802978515625],
	# 'dense': [0.70263671875, 0.43896484375, 0.361083984375, 0.67822265625],
	# 'sparse+dense': [0.5905762314796448, 0.29696908593177795, 0.2407226711511612, 0.5122477412223816],
	# 'colbert+sparse+dense': [0.6736379861831665, 0.3537241816520691, 0.3230167627334595, 0.6232604384422302]
	# }
	```