RAG
Collection
4 items
•
Updated
This is a pythera/mbert-retrieve-qry-base: It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.
import torch
from transformers import AutoModel, AutoTokenizer
# CLS Pooling - Take output from first token
def cls_pooling(model_output):
return model_output.last_hidden_state[:,0]
# Encode text
def encode(texts):
# Tokenize sentences
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
# Perform pooling
embeddings = cls_pooling(model_output)
return embeddings
# Prepare query want to embedding
query = [
'Tại sao bầu trời lại màu xanh?',
'Định nghĩa Generative AI là gì?'
]
# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('pythera/mbert-retrieve-qry-base')
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-qry-base')
# Encode docs
output_emb = encode(query)
print('Output embedding: ', output_emb)
We evaluate our research on mMARCO (vi) with several methods:
Model | Trained Datasets | Recall@1000 | MRR@10 |
---|---|---|---|
vietnamese-bi-encoder | MSMACRO + SQuADv2.0 + 80% Zalo | 79.58 | 18.74 |
mColB | MSMACRO | 71.90 | 18.0 |
mbert (our) | MSMACRO | 85.86 | 21.42 |