mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval
Abstract
We present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.
Community
New multilingual embedding and reranking model!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe (2024)
- Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models (2024)
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment (2024)
- News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation (2024)
- MINERS: Multilingual Language Models as Semantic Retrievers (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 15
Browse 15 models citing this paperDatasets citing this paper 0
No dataset linking this paper