Training data annotations for MTEB
Dear Alibaba NLP team,
I am Márton, maintainer of the Massive Multilingual Text Embedding Benchmark. As a part of our recent efforts to extend MTEB to be a fully multilingual benchmark, we have been trying to collect information on which tasks in our benchmarks can be considered in/out-of-domain for different models (essentially track which models were finetuned on which benchmark tasks). This is useful to our users, as they can not only see how well models perform on the tasks, but also how reliably this might indicate the models' generalization performance. I am writing to you, as we still lack training data annotations on your models, some of which are already present on our leaderboard.
We have read a number of your technical reports, but it was either unclear to us, or slipped our attention, if/which datasets from the Chinese/English/Multilingual MTEB have been used to train the following models:
org: Alibaba-NLP
- gte-Qwen2-7B-instruct
- gte-Qwen1.5-7B-instruct
- gte-Qwen2-1.5B-instruct
- gte-base-en-v1.5
org: thenlper - gte-small-zh
- gte-base-zh
- gte-large-zh
- gte-base
- gte-large
- gte-small
As it currently stands, we display a warning to our users, indicating that we do not have access to information about the training data.
It would be much appreciated by the team and our users if you could share this information with us.
Thanks in advance for your help,
Márton