---
library_name: transformers
license: cc-by-4.0
datasets:
- Goader/kobza
language:
- uk
pipeline_tag: fill-mask
tags: []
---
# Modern-LiBERTa
Modern-LiBERTa is a ModernBERT encoder model designed specifically for **Ukrainian**, with support for **long contexts up to 8,192 tokens**. It was introduced in the paper On the Path to Make Ukrainian a High-Resource Language presented at the [UNLP](https://unlp.org.ua/) @ [ACL 2025](https://2025.aclweb.org/).
The model is pre-trained on **Kobza**, a large-scale Ukrainian corpus of nearly 60 billion tokens. Modern-LiBERTa builds on the [ModernBERT](https://arxiv.org/abs/2412.13663) architecture and is the first Ukrainian language model to support long-context encoding efficiently.
The goal of this work is to **make Ukrainian a first-class citizen in multilingual and monolingual NLP**, enabling robust performance on complex tasks that require broader context and knowledge access.
All training code and tokenizer tools are available in the [Goader/ukr-lm](https://github.com/Goader/ukr-lm) repository.
## Evaluation
| | NER-UK (Micro F1) | WikiANN (Micro F1) | UD POS (Accuracy) | News (Macro F1) |
|:------------------------------------------------------------------------------------------------------------------------|:------------------------:|:------------------:|:------------------------------:|:----------------------------------------:|
|
Base Models |
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 90.86 (0.81) | 92.27 (0.09) | 98.45 (0.07) | - |
| [roberta-base-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-base-wechsel-ukrainian) | 90.81 (1.51) | 92.98 (0.12) | 98.57 (0.03) | - |
| [electra-base-ukrainian-cased-discriminator](https://huggingface.co/lang-uk/electra-base-ukrainian-cased-discriminator) | 90.43 (1.29) | 92.99 (0.11) | 98.59 (0.06) | - |
| Large Models |
| [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) | 90.16 (2.98) | 92.92 (0.19) | 98.71 (0.04) | 95.13 (0.49) |
| [roberta-large-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-large-wechsel-ukrainian) | 91.24 (1.16) | 93.22 (0.17) | 98.74 (0.06) | __96.48 (0.09)__ |
| [liberta-large](https://huggingface.co/Goader/liberta-large) | 91.27 (1.22) | 92.50 (0.07) | 98.62 (0.08) | 95.44 (0.04) |
| [liberta-large-v2](https://huggingface.co/Goader/liberta-large-v2) | __91.73 (1.81)__ | 93.22 (0.14) | __98.79 (0.06)__ | 95.67 (0.12) |
| [modern-liberta-large-v2](https://huggingface.co/Goader/modern-liberta-large) | 91.66 (0.57) | __93.37 (0.16)__ | __98.78 (0.07)__ | 96.37 (0.07) |
## Fine-Tuning Hyperparameters
| Hyperparameter | Value |
|:---------------|:-----:|
| Peak Learning Rate | 3e-5 |
| Warm-up Ratio | 0.05 |
| Learning Rate Decay | Linear |
| Batch Size | 16 |
| Epochs | 10 |
| Weight Decay | 0.05 |
## How to Get Started with the Model
Use the code below to get started with the model. Note, that the repository contains custom code for tokenization:
Pipeline usage:
```python
>>> from transformers import pipeline
>>> fill_mask = pipeline("fill-mask", "Goader/modern-liberta-large", trust_remote_code=True)
>>> fill_mask("Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі яблук мамі.")
[{'score': 0.3426803946495056,
'token': 8638,
'token_str': 'шість',
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі шість яблук мамі.'},
{'score': 0.21772164106369019,
'token': 24170,
'token_str': 'решту',
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі решту яблук мамі.'},
{'score': 0.16074775159358978,
'token': 9947,
'token_str': 'вісім',
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі вісім яблук мамі.'},
{'score': 0.078955739736557,
'token': 2036,
'token_str': 'сім',
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі сім яблук мамі.'},
{'score': 0.028996430337429047,
'token': 813,
'token_str': '6',
'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі 6 яблук мамі.'}]
```
Extracting embeddings:
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Goader/modern-liberta-large", trust_remote_code=True)
model = AutoModel.from_pretrained("Goader/modern-liberta-large")
encoded = tokenizer('Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі шість яблук мамі.', return_tensors='pt')
output = model(**encoded)
```
## Licence
CC-BY 4.0
## Authors
Mykola Haltiuk,
PhD Candidate @ AGH University of Krakow
Aleksander Smywiński-Pohl,
PhD @ AGH University of Krakow