mlsa-iai-msu-lab/sci-rus-tiny

SciRus-tiny is a model to obtain embeddings of scientific texts in russian and english. Model was trained on eLibrary data with contrastive technics described in habr post. High metrics values were achieved on the ruSciBench benchmark.

How to get embeddings

from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F
import torch


tokenizer = AutoTokenizer.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
model = AutoModel.from_pretrained("mlsa-iai-msu-lab/sci-rus-tiny")
# model.cuda()  # if you want to use a GPU

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def get_sentence_embedding(title, abstract, model, tokenizer, max_length=None):
    # Tokenize sentences
    sentence = '</s>'.join([title, abstract])
    encoded_input = tokenizer(
        [sentence], padding=True, truncation=True, return_tensors='pt', max_length=max_length).to(model.device)
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Perform pooling
    sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
    # Normalize embeddings
    sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
    return sentence_embeddings.cpu().detach().numpy()[0]

print(get_sentence_embedding('some title', 'some abstract', model, tokenizer).shape)
# (312,)

Or you can use the sentence_transformers:

from sentence_transformers import SentenceTransformer


model = SentenceTransformer('mlsa-iai-msu-lab/sci-rus-tiny')
embeddings = model.encode(['some title' + '</s>' + 'some abstract'])
print(embeddings[0].shape)
# (312,)

Authors

Benchmark developed by MLSA Lab of Institute for AI, MSU.

Acknowledgement

The research is part of the project #23-Ш05-21 SES MSU "Development of mathematical methods of machine learning for processing large-volume textual scientific information". We would like to thank eLibrary for provided datasets.

Contacts

Nikolai Gerasimenko ([email protected]), Alexey Vatolin ([email protected])

Citation

@article{Gerasimenko2024,
  author  = {Gerasimenko, N. and Vatolin, A. and Ianina, A. and Vorontsov, K.},
  title   = {SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts},
  journal = {Doklady Mathematics},
  year    = {2024},
  volume  = {110},
  number  = {1},
  pages   = {S193--S202},
  month   = {dec},
  issn    = {1531-8362},
  doi     = {10.1134/S1064562424602178},
  url     = {https://doi.org/10.1134/S1064562424602178}
}