--- license: mit datasets: - IlyaGusev/gazeta - csebuetnlp/xlsum language: - ru metrics: - bertscore - bleu - rouge - chrf - meteor tags: - text2text-generation - summarization - russian - t5 base_model: - ai-forever/ruT5-base --- # ruT5-base Model for Abstractive Summarization of Russian News This is the `ai-forever/ruT5-base` model, fine-tuned for the task of abstractive summarization of news texts in Russian. ## Model Description The model is based on the T5 (Text-to-Text Transfer Transformer) architecture – an encoder-decoder transformer. The original pre-trained model `ai-forever/ruT5-base` was fine-tuned on a combined dataset consisting of Russian news articles from the Gazeta datasets and the Russian part of XLSum. Details of the training process and results analysis can be found in the [GitHub repository](https://github.com/XristoLeonov/ru-text-summarization). **Fine-tuning Parameters (key):** * **Base model:** `ai-forever/ruT5-base` * **Dataset:** Combined Gazeta + XLSum (Russian part), ~32k "article-summary" pairs after filtering. * **Max input length:** 512 tokens * **Max output length (summary):** 64 tokens ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM MODEL_NAME = "Xristo/ruT5-base-rus-news-sum" tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME) article_text = """...""" input_ids = tokenizer( [article_text], max_length=512, padding="max_length", truncation=True, return_tensors="pt", )["input_ids"] output_ids = model.generate( input_ids=input_ids, max_length=64, no_repeat_ngram_size=3, num_beams=4, early_stopping=True ) summary = tokenizer.decode(output_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True) print("Generated summary:") print(summary) ``` ## Evaluation Results (Metrics) Evaluation was performed on a held-out test set (10% of the filtered Gazeta+XLSum dataset). The best checkpoint (20th epoch) showed the following results: | Model | ROUGE-1 F1 | ROUGE-2 F1 | ROUGE-L F1 | METEOR | BERTScore F1 | CHRF++ | BLEU | |----------------|------------|------------|------------|---------|--------------|---------|---------| | ruT5-base | 30.73 | 15.22 | 27.94 | 29.42 | 78.36 | 40.06 | 10.91 | **Comparison with baseline models:** When compared to models `IlyaGusev/mbart_ru_sum_gazeta` (max summary length 200 tokens, R1=32.4, R2=14.3, RL=28.0, METEOR=26.4) and `csebuetnlp/mT5_multilingual_XLSum` (max summary length 84 tokens, R1=32.2, R2=13.6, RL=26.2 for RU XL-Sum), this fine-tuned `ruT5-base` model (with a max summary length of 64 tokens) demonstrates competitive results, surpassing them in ROUGE-2 and METEOR, which indicates a high information density of the generated summaries.