RAG
Collection
4 items
•
Updated
This is a pythera/mbert-retrieve-ctx-base model: It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.
import torch
from transformers import AutoModel, AutoTokenizer
# CLS Pooling - Take output from first token
def cls_pooling(model_output):
return model_output.last_hidden_state[:,0]
# Encode text
def encode(texts):
# Tokenize sentences
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input, return_dict=True)
# Perform pooling
embeddings = cls_pooling(model_output)
return embeddings
# Prepare documents want to embedding
passage = [
'2023 đánh dấu AI không còn bó hẹp trong cộng đồng nhỏ, mà ứng dụng rộng khắp để phục vụ hàng triệu người Việt, từ viết văn đến tạo ảnh avatar.',
'According to industry reports, the global machine learning market is expected to reach a staggering $96.7 billion by 2025.'
]
# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('pythera/mbert-retrieve-ctx-base')
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-ctx-base')
# Encode docs
output_emb = encode(passage )
print('Output embedding: ', output_emb)
We evaluate our research on mMARCO (vi) with several methods:
Model | Trained Datasets | Recall@1000 | MRR@10 |
---|---|---|---|
vietnamese-bi-encoder | MSMACRO + SQuADv2.0 + 80% Zalo | 79.58 | 18.74 |
mColB | MSMACRO | 71.90 | 18.0 |
mbert (our) | MSMACRO | 85.86 | 21.42 |