hivaze/ru-e5-large · Hugging Face

About model creation

This is a smaller version of the intfloat/multilingual-e5-large with only some Russian (Cyrillic in general) and English (fever) tokens (and embeddings) left.

The model created in a similar way as described in this https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90 post.

The CulturaX dataset was used to search for the required tokens. As a result, out of 250k tokens of the original model, only 69,382 required were left.

Was the model trained in any way?

No. The tokenizer has been modified, and all changes to token identifiers have been corrected by moving embeddings in the model word_embeddings module to their new places, so the quality of this model on Cyrilic (and English) is exactly the same as the original one.

Why do we need this?

This allows you to use significantly less memory during training and also greatly reduces the weight of the model.

Authors

Sergei Bratchikov (https://t.me/nlpwanderer)