Cross-Encoder for multilingual MS Marco

This model was trained on the MMARCO dataset. It is a machine translated version of MS MARCO using Google Translate. It was translated to 14 languages. In our experiments, we observed that it performs also well for other languages.

As a base model, we used the multilingual MiniLMv2 model.

The model can be used for Information Retrieval: Given a query, encode the query will all possible passages (e.g. retrieved with ElasticSearch). Then sort the passages in a decreasing order. See SBERT.net Retrieve & Re-rank for more details. The training code is available here: SBERT.net Training MS Marco

Usage with SentenceTransformers

The usage becomes easy when you have SentenceTransformers installed. Then, you can use the pre-trained models like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([('Query', 'Paragraph1'), ('Query', 'Paragraph2') , ('Query', 'Paragraph3')])

Usage with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('model_name')
tokenizer = AutoTokenizer.from_pretrained('model_name')

features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)
Downloads last month
5,998
Safetensors
Model size
118M params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

Spaces using cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 8