Gameselo/STS-multilingual-mpnet-base-v2

Feb 26

Hi @Gameselo
I'm Márton, maintainer of MTEB. I'm writing to you as we have been collecting metadata on models to provide our users a realistic estimate of how much models' scores on MTEB can be considered to be indicative of their generalized performance (if models train on MTEB, they obviously perform better).
We are still lacking annotations on your model, and I failed to find information about what your model has been trained on.
Can you please tell us, which datasets in MTEB, in particular in the multilingual benchmark were or were not used to train this model?
Thanks in advance, Márton

Gameselo

Owner Mar 1

Hi Márton, hope you're fine.
Sorry for my late answer, but here it is.
My model has been trained on my whole dataset (https://huggingface.co/datasets/Gameselo/monolingual-wideNLI), on its page you'll find all the specs. I used all the data that wasn't in MTEB evaluation datasets.
I actually used MTEB training sets to train my model too, but not the other ones (I used the dev split for validation and test split for evaluation then).
Hope it will help you.
Léo

kardosdrur

Mar 2

Hi again, no worries, and thanks for getting back to me.
When you say you have trained on the MTEB training sets, do you mean all of them or just the ones that are in your dataset (monolingual-wideNLI)?

Gameselo

Owner Mar 3

All of them !

kardosdrur

Mar 3

Thanks! I'm adding the annotations!

Gameselo

Owner Mar 3

For your convenience, I used the train datasets of the May-June 2024 version of MTEB leaderboard. New datasets used in the MTEB leaderboard aren't used at all

Gameselo
/

STS-multilingual-mpnet-base-v2

Training data