You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

This is a pythera/mbert-retrieve-ctx-base model: It maps paragraphs to a 768-dimensional dense vector space and is optimized for the task of semantic search.

Usage

import torch
from transformers import AutoModel, AutoTokenizer

# CLS Pooling - Take output from first token
def cls_pooling(model_output):
    return model_output.last_hidden_state[:,0]

# Encode text
def encode(texts):
    # Tokenize sentences
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

    # Compute token embeddings
    with torch.no_grad():
    model_output = model(**encoded_input, return_dict=True)

    # Perform pooling
    embeddings = cls_pooling(model_output)

    return embeddings

# Prepare documents want to embedding
passage = [
'2023 đánh dấu AI không còn bó hẹp trong cộng đồng nhỏ, mà ứng dụng rộng khắp để phục vụ hàng triệu người Việt, từ viết văn đến tạo ảnh avatar.',
'According to industry reports, the global machine learning market is expected to reach a staggering $96.7 billion by 2025.'
]

# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('pythera/mbert-retrieve-ctx-base')
tokenizer = AutoTokenizer.from_pretrained('pythera/mbert-retrieve-ctx-base')

# Encode docs
output_emb = encode(passage )
print('Output embedding: ', output_emb)

Evaluation

We evaluate our research on mMARCO (vi) with several methods:

Model Trained Datasets Recall@1000 MRR@10
vietnamese-bi-encoder MSMACRO + SQuADv2.0 + 80% Zalo 79.58 18.74
mColB MSMACRO 71.90 18.0
mbert (our) MSMACRO 85.86 21.42
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train pythera/mbert-retrieve-ctx-base

Collection including pythera/mbert-retrieve-ctx-base