File size: 8,671 Bytes
7e1f3f5
 
24ecd64
 
 
 
 
 
7e1f3f5
 
 
06f229d
7e1f3f5
06f229d
9e8d2bc
 
 
06f229d
9e8d2bc
06f229d
9e8d2bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06f229d
ae6ccb9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e8d2bc
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
library_name: transformers
license: cc-by-4.0
datasets:
  - Goader/kobza
language:
  - uk
pipeline_tag: fill-mask
tags: []
---

<h1 align="center">Modern-LiBERTa</h1>

<h2 align="center">On the Path to Make Ukrainian a High-Resource Language <a href="https://aclanthology.org/2025.unlp-1.14/">[paper]</a></h2>


<!-- Provide a quick summary of what the model is/does. -->
Modern-LiBERTa is a ModernBERT encoder model designed specifically for **Ukrainian**, with support for **long contexts up to 8,192 tokens**. It was introduced in the paper [On the Path to Make Ukrainian a High-Resource Language](https://aclanthology.org/2025.unlp-1.14/) presented at the [UNLP](https://unlp.org.ua/) @ [ACL 2025](https://2025.aclweb.org/).

The model is pre-trained on **Kobza** [[HF](https://huggingface.co/datasets/Goader/kobza)], a large-scale Ukrainian corpus of nearly 60 billion tokens. Modern-LiBERTa builds on the [ModernBERT](https://arxiv.org/abs/2412.13663) architecture and is the first Ukrainian language model to support long-context encoding efficiently.

The goal of this work is to **make Ukrainian a first-class citizen in multilingual and monolingual NLP**, enabling robust performance on complex tasks that require broader context and knowledge access.

All training code and tokenizer tools are available in the [Goader/ukr-lm](https://github.com/Goader/ukr-lm) repository.


## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

<!-- Read the [paper](https://aclanthology.org/2024.unlp-1.14/) for more detailed tasks descriptions. -->

|                                                                                                                         | NER-UK (Micro F1)   | WikiANN (Micro F1) | UD POS (Accuracy)              | News (Macro F1) |
|:------------------------------------------------------------------------------------------------------------------------|:------------------------:|:------------------:|:------------------------------:|:----------------------------------------:|
| <tr><td colspan="5" style="text-align: center;"><strong>Base Models</strong></td></tr>
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base)                                                  | 90.86 (0.81)             | 92.27 (0.09)       | 98.45 (0.07)                   | -                                        |
| [roberta-base-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-base-wechsel-ukrainian)                        | 90.81 (1.51)             | 92.98 (0.12)       | 98.57 (0.03)                   | -                                        |
| [electra-base-ukrainian-cased-discriminator](https://huggingface.co/lang-uk/electra-base-ukrainian-cased-discriminator) | 90.43 (1.29)             | 92.99 (0.11)       | 98.59 (0.06)                   | -                                        |
| <tr><td colspan="5" style="text-align: center;"><strong>Large Models</strong></td></tr>
| [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)                                                | 90.16 (2.98)             | 92.92 (0.19)       | 98.71 (0.04)                   | 95.13 (0.49)                             |
| [roberta-large-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-large-wechsel-ukrainian)                      | 91.24 (1.16)             | 93.22 (0.17)       | 98.74 (0.06)                   | __96.48 (0.09)__                         |
| [liberta-large](https://huggingface.co/Goader/liberta-large)                                                            | 91.27 (1.22)             | 92.50 (0.07)       | 98.62 (0.08)                   | 95.44 (0.04)                             |
| [liberta-large-v2](https://huggingface.co/Goader/liberta-large-v2)                                                      | __91.73 (1.81)__         | 93.22 (0.14)       | __98.79 (0.06)__               | 95.67 (0.12)                             |
| [modern-liberta-large-v2](https://huggingface.co/Goader/modern-liberta-large)                                           | 91.66 (0.57)             | __93.37 (0.16)__   | __98.78 (0.07)__               | 96.37 (0.07)                             |         


## Fine-Tuning Hyperparameters

| Hyperparameter | Value |
|:---------------|:-----:|
| Peak Learning Rate  | 3e-5   |
| Warm-up Ratio       | 0.05   |
| Learning Rate Decay | Linear |
| Batch Size          | 16     |
| Epochs              | 10     |
| Weight Decay        | 0.05   |


## How to Get Started with the Model

Use the code below to get started with the model. Note, that the repository contains custom code for tokenization:

Pipeline usage:

```python
>>> from transformers import pipeline
>>> fill_mask = pipeline("fill-mask", "Goader/modern-liberta-large", trust_remote_code=True)
>>> fill_mask("Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі <mask> яблук мамі.")
[{'score': 0.3426803946495056,
  'token': 8638,
  'token_str': 'шість',
  'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі шість яблук мамі.'},
 {'score': 0.21772164106369019,
  'token': 24170,
  'token_str': 'решту',
  'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі решту яблук мамі.'},
 {'score': 0.16074775159358978,
  'token': 9947,
  'token_str': 'вісім',
  'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі вісім яблук мамі.'},
 {'score': 0.078955739736557,
  'token': 2036,
  'token_str': 'сім',
  'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі сім яблук мамі.'},
 {'score': 0.028996430337429047,
  'token': 813,
  'token_str': '6',
  'sequence': 'Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі 6 яблук мамі.'}]
```

Extracting embeddings:

```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Goader/modern-liberta-large", trust_remote_code=True)
model = AutoModel.from_pretrained("Goader/modern-liberta-large")
encoded = tokenizer('Тарас мав чотири яблука. Марічка подарувала йому ще два. Він віддав усі шість яблук мамі.', return_tensors='pt')
output = model(**encoded)
```

## Citation

```bibtex
@inproceedings{haltiuk-smywinski-pohl-2025-path,
    title = "On the Path to Make {U}krainian a High-Resource Language",
    author = "Haltiuk, Mykola  and
      Smywi{\'n}ski-Pohl, Aleksander",
    editor = "Romanyshyn, Mariana",
    booktitle = "Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria (online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.unlp-1.14/",
    pages = "120--130",
    ISBN = "979-8-89176-269-5",
    abstract = "Recent advances in multilingual language modeling have highlighted the importance of high-quality, large-scale datasets in enabling robust performance across languages. However, many low- and mid-resource languages, including Ukrainian, remain significantly underrepresented in existing pretraining corpora. We present Kobza, a large-scale Ukrainian text corpus containing nearly 60 billion tokens, aimed at improving the quality and scale of Ukrainian data available for training multilingual language models. We constructed Kobza from diverse, high-quality sources and applied rigorous deduplication to maximize data utility. Using this dataset, we pre-trained Modern-LiBERTa, the first Ukrainian transformer encoder capable of handling long contexts (up to 8192 tokens). Modern-LiBERTa achieves competitive results on various standard Ukrainian NLP benchmarks, particularly benefiting tasks that require broader contextual understanding or background knowledge. Our goal is to support future efforts to develop robust Ukrainian language models and to encourage greater inclusion of Ukrainian data in multilingual NLP research."
}
```

## Licence

CC-BY 4.0

## Authors

Mykola Haltiuk,
PhD Candidate @ AGH University of Krakow

Aleksander Smywiński-Pohl,
PhD @ AGH University of Krakow