Announcing MamayLM, an efficient state-of-the-art Ukrainian LLM

Community Article Published April 23, 2025

Upvote

INSAIT-Institute

INSAIT-Institute

INSAIT-Institute

We are releasing MamayLM, the best-performing efficient Ukrainian language model that surpasses all similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.

We are delighted to announce the release of MamayLM, a new state-of-the-art LLM targeting the Ukrainian language. With its 9 billion parameters, it is cost-efficient and can be run on 1 GPU, yet is effective in both Ukrainian and English. The model comes with strong capabilities outpacing open models of similar sizes in both languages, while matching or favourably comparing against much larger models. MamayLM is a result of a collaboration between researchers at INSAIT and ETH Zurich. An Ukranian version of this blog post is available here.

MamayLM is built on top of Google’s Gemma 2 9B model, which INSAIT previously used as the basis for developing the BgGPT 2.0 series of models featured in this blog post from Google. Following a similar recipe, with some improvements to continual training, merging and benchmarking as well as the addition of synthetic data, we have now forged a new model that is lightweight, practical but very capable of understanding and producing text in Ukrainian, while also retaining and even improving its base capabilities. MamayLM is tailored to Ukrainian specifics, being an expert in the native language and the cultural nuances. It provides a strong basis for building applications on top of the model, for integration in government institutions, especially in scenarios where preserving data privacy is critical (as the model can be run locally), as well as for cost-effective personal use.

Adapting Gemma 2 for Ukrainian

Prior to MamayLM, we successfully specialized the Gemma 2 family of models to Bulgarian, thanks to our research in language transfer [1], combined with the already robust multilingual capabilities of Gemma 2. Now, we have applied a similar pipeline of data curation, continual pretraining and instruction fine-tuning, with some notable improvements in various aspects to adapt Gemma 2 9B to Ukrainian using a total of 75B tokens of Ukrainian and English text.

To collect pretraining data, we utilize publicly available datasets such as FineWeb2, Malyuk, CulturaX, and Ukrainian Wikipedia. These datasets have been preprocessed and filtered to ensure cleaner data. We employ exact and fuzzy deduplication to prevent overlap between datasets, all of which are web-scraped.

During pretraining, we used best-fit packing [13] to pack sequences at the desired context length, preserving data structure and coherence with minimal disruption. This approach enhances context learning and improves language reasoning capabilities. To prevent catastrophic forgetting, we include a small proportion of English-centric data, such as English Wikipedia and Smoltalk [14].

For post-training, we extracted topics relevant to Ukrainian history and culture, which enabled the generation of a synthetic dataset of Ukrainian QA pairs using knowledge distillation from a larger model. We also employed our LLM-based translation pipeline to translate domain-specific data to Ukrainian, enhancing both quantity and quality in the target language.

Our instruction-tuning dataset incorporates various open-source datasets, such as the Nemotron SFT dataset, OpenCoder (OPC) SFT dataset, Aya Collection, and more. We acknowledge the significant contributions of the Ukrainian open-source community, particularly creators of Spivavtor, UAlpaca, UA-Squad, Ukrainian StackExchange, and UA-Lawyer QA, which amplify the potential of Ukrainian post-training.

We also applied an advanced model merging technique inspired by Layer Swapping [11] to more precisely extract linguistic capabilities. Further, we consider findings on language imbalances and model merging [1,12], which highlight the impact of data mixing proportions on model performance.

English and Ukrainian Benchmarks

We evaluated MamayLM on a set of standard English benchmarks, a translated version of them in Ukrainian, as well as Ukrainian-specific benchmarks we collected:

ZNO [8]: mandatory testing of knowledge of the Ukrainian high school curriculum in Ukrainian language & literature, history, mathematics and geography
Winogrande challenge [2]: testing world knowledge and understanding
Hellaswag [3]: testing sentence completion
ARC Easy/Challenge [4]: testing logical reasoning
TriviaQA [5]: testing trivia knowledge
GSM-8K [6]: solving multiple-choice questions in high-school mathematics
MMLU [9]: testing knowledge on a multitude of topics
IFEval [10]: testing instruction-following skills

We undertook the challenge of unraveling the best translation method for the English-only benchmarks. Although some effort has been made in this direction [7], we found that it was not extensive enough, and the Ukrainian translations could be improved. We identified two key issues in benchmark translation: (i) the separation of question and answer during translation, and (ii) translation quality heavily relying on few-shot prompting or additional model output verification. To address these issues, we developed a translation framework that preserves the context of both questions and answers. It also employs multisampling and scoring of translation candidates to optimize the balance between machine translation quality and human involvement, ensuring maximum efficiency. We are releasing all benchmarks in Ukrainian as part of this release in the according GitHub repository. We will soon publish more details of our translation framework.

Evaluation on Mandatory National Ukrainian Exams

Importantly, as the figure below shows, MamayLM achieves the highest score on the ZNO (National Ukrainian) high school exams amongst similarly sized models, while outperforming much larger models, including Gemma2 27B, Llama 3.1 70B and Qwen 2.5 72B.

Evaluation Against Similarly Sized Models

As illustrated by the figures below, across all benchmarks, MamayLM outperforms all similarly sized models (up to 13B). It does so on all benchmarks in both English and Ukrainian, thanks to the particular method used to train MamayLM (mentioned above).

Benchmark Evaluation Against Larger Models

We also evaluated MamayLM against current state-of-the-art LLMs. Impressively, our model outperforms models up to 8 times larger across various benchmarks, including those specific to Ukrainian contexts, as shown in the figure below.

Generative Performance vs. Larger Models

In addition to benchmarks, we evaluated MamayLM in terms of generative performance on 500 complex questions. The results show that our model significantly surpasses the performance of much larger models in both the linguistic qualities of the generated Ukrainian text as well as the content itself. To avoid biases and obtain the best possible judgments, we utilize Gemini 2.0 Flash, which excels at Ukrainian and understands the cultural and linguistic peculiarities.

We evaluate the model performance on Ukrainian QA data, where our model shows positive performance against much larger models as well as GPT-4o-mini.

Benefits of MamayLM

In the current technological landscape, the need for fast, adaptable, and locally optimized solutions has become critical. As a 9B model, MamayLM is relatively compact and consistently outperforms models up to 10x larger in both English and Ukrainian. Its ability to operate on a single GPU allows for faster adaptation, lower operational costs, and simpler deployment, making it particularly well-suited for environments with limited resources and evolving demands.

This offers significant advantages for Ukranian local businesses and government institutions, which can integrate advanced AI technologies without the prohibitive costs or complex technical requirements typically associated with larger systems. Additionally, the model’s bilingual capabilities support its application in sectors such as education and healthcare, where addressing language barriers can have a meaningful impact. In particular, it helps meet immediate needs in Ukraine by enhancing service delivery across critical areas.

Download Models and Benchmarks

We make normal and quantized versions of MamayLM available on HuggingFace, alongside a detailed description of how to use them for inference.

If you use our models, please consider citing our work:

@misc{MamayLM,
  author = {Yukhymenko, Hanna and Alexandrov, Anton and Vechev, Martin},
  title = {MamayLM: An efficient state-of-the-art Ukrainian LLM},
  year = {2025},
  publisher = {INSAIT},
  howpublished = {https://huggingface.co/blog/INSAIT-Institute/mamaylm}
}

More on INSAIT

INSAIT is a world-class computer science and AI research institute, which is part of Sofia University, located in Sofia, Bulgaria. INSAIT was created in 2022, in partnership with Switzerland's ETH Zurich and EPFL. It is a strategic institution for Bulgaria, funded with an initial endowment of around 100M USD by the Bulgarian government, over a period of 10 years, and is generously supported with donations of roughly 15M USD from SiteGround, Google, AWS, VMware and other big-tech companies. INSAIT is the first center of its kind in Eastern Europe, structured according to top Western computer science and AI institutions – it provides world-class packages and conditions for outstanding tenure-track and tenured faculty, research scientists, post-docs, PhDs and many other positions. Currently, INSAIT hosts researchers from more than 23 nationalities and does research in areas spanning foundational models, safe and secure AI, robotics, computer vision, quantum computing, algorithms, information security, and other key areas.

Contact us

For any questions on MamayLM, please contact us at [email protected].

References

[1] Mitigating Catastrophic Forgetting in Language Transfer via Model Merging, Anton Alexandrov, Veselin Raychev, Mark Niklas Mueller, Ce Zhang, Martin Vechev, Kristina Toutanova. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 17167–17186, Miami, Florida, USA. Association for Computational Linguistics. https://aclanthology.org/2024.findings-emnlp.1000
[2] Winogrande: An adversarial winograd schema challenge at scale, Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Communications of the ACM, 64(9):99–106, 2021.
[3]Hellaswag: Can a machine really finish your sentence?, Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. https://arxiv.org/abs/1905.07830
[4] Think you have solved question answering? try arc, the ai2 reasoning challenge, Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. https://arxiv.org/abs/1803.05457
[5] Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. https://arxiv.org/abs/1705.03551
[6] Training verifiers to solve math word problems, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. https://arxiv.org/abs/2110.14168
[7] Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation. Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, Sara Hooker https://arxiv.org/abs/2412.03304
[8] ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian. Mykyta Syromiatnikov, Victoria Ruvinskaya, Anastasiya Troynina. https://arxiv.org/abs/2501.06715
[9] Measuring Massive Multitask Language Understanding. Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt. In International Conference on Learning Representations, 2021, https://openreview.net/pdf?id=d7KBjmI3GmQ
[10] Instruction-Following Evaluation for Large Language Models. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, Le Hou. https://arxiv.org/abs/2311.07911
[11] Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models. {Lucas Bandarkar and Benjamin Muller and Pritish Yuvraj and Rui Hou and Nayan Singhal and Hongjiang Lv and Bing Liu. The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=vQhn4wrQ6j
[12] The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments. Anton Schäfer, Shauli Ravfogel, Thomas Hofmann, Tiago Pimentel, Imanol Schlag. https://arxiv.org/abs/2404.07982
[13] Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, and Stefano Soatto. 2024. Fewer truncations improve language modeling. In Proceedings of the 41st International Conference on Machine Learning (ICML'24), Vol. 235. JMLR.org, Article 439, 11030–11048.
[14] SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model. Loubna Ben Allal and Anton Lozhkov and Elie Bakouch and Gabriel Martín Blázquez and Guilherme Penedo and Lewis Tunstall and Andrés Marafioti and Hynek Kydlíček and Agustín Piqueres Lajarín and Vaibhav Srivastav and Joshua Lochner and Caleb Fahlgren and Xuan-Son Nguyen and Clémentine Fourrier and Ben Burtenshaw and Hugo Larcher and Haojun Zhao and Cyril Zakka and Mathieu Morlon and Colin Raffel and Leandro von Werra and Thomas Wolf. https://arxiv.org/abs/2502.02737

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote