mBERTu

A Maltese multilingual model pre-trained on the Korpus Malti v4.0 using multilingual BERT as the initial checkpoint.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at https://mlrs.research.um.edu.mt/.

Citation

This work was first presented in Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese. Cite it as follows:

@inproceedings{BERTu,
    title = "Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and {BERT} Models for {M}altese",
    author = "Micallef, Kurt  and
              Gatt, Albert  and
              Tanti, Marc  and
              van der Plas, Lonneke  and
              Borg, Claudia",
    booktitle = "Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing",
    month = jul,
    year = "2022",
    address = "Hybrid",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.deeplo-1.10",
    doi = "10.18653/v1/2022.deeplo-1.10",
    pages = "90--101",
}

Dataset used to train MLRS/mBERTu

Evaluation results

Unlabelled Attachment Score on Maltese Universal Dependencies Treebank (MUDT)
self-reported

92.100
Labelled Attachment Score on Maltese Universal Dependencies Treebank (MUDT)
self-reported

87.870
UPOS Accuracy on MLRS POS dataset
self-reported

98.660
XPOS Accuracy on MLRS POS dataset
self-reported

98.580
Span-based F1 on WikiAnn (Maltese)
self-reported

86.600
Macro-averaged F1 on Maltese Sentiment Analysis Dataset
self-reported

76.790

View on Papers With Code