Language distribution in MTEB

#83
by maiia-bocharova - opened

Hello, I am writing a paper for my PhD (on text embeddings for Ukrainian language) and I want to include information about language distribution in MTEB (maybe token per language) into my paper. How can I get such statistics? I did not find anything apart from number of languages in the official MTEB paper.
Can you please help?

Massive Text Embedding Benchmark org

Hi @Maiia, I believe Ukrainian is only included in the bitext mining tasks. You can easily search the GitHub repo and see it here:

https://github.com/search?q=repo%3Aembeddings-benchmark%2Fmteb+%22uk%22&type=code

Sign up or log in to comment