EstBERT

What's this?

The EstBERT model is a pretrained BERT_Base model exclusively trained on Estonian cased corpus on both 128 and 512 sequence length of data.

How to use?

You can use the model with the transformers library the following way.

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("tartuNLP/EstBERT")
model = AutoModelForMaskedLM.from_pretrained("tartuNLP/EstBERT")

You can also download the pretrained model from here, EstBERT_128 EstBERT_512

Dataset used to train the model

The EstBERT model is trained both on 128 and 512 sequence length of data. For training the EstBERT we used the Estonian National Corpus 2017, which was the largest Estonian language corpus available at the time. It consists of four sub-corpora: Estonian Reference Corpus 1990-2008, Estonian Web Corpus 2013, Estonian Web Corpus 2017 and Estonian Wikipedia Corpus 2017.

Reference to cite

Tanvir et al 2021

Why would I use?

Overall EstBERT performs better in parts of speech (POS), name entity recognition (NER), rubric, and sentiment classification tasks compared to mBERT and XLM-RoBERTa. The comparative results can be found below;

Model	UPOS	XPOS	Morph	bf UPOS	bf XPOS	Morph
EstBERT	*97.89*	98.40	96.93	97.84	*98.43*	*96.80*
mBERT	97.42	98.06	96.24	97.43	98.13	96.13
XLM-RoBERTa	97.78	98.36	96.53	97.80	98.40	96.69

Model	Rubric₁₂₈	Sentiment₁₂₈	Rubric₁₂₈	Sentiment₅₁₂
EstBERT	*81.70*	74.36	80.96	74.50
mBERT	75.67	70.23	74.94	69.52
XLM-RoBERTa	80.34	74.50	78.62	*76.07*

Model	Precicion₁₂₈	Recall₁₂₈	F1-Score₁₂₈	Precision₅₁₂	Recall₅₁₂	F1-Score₅₁₂
EstBERT	88.42	90.38	*89.39*	88.35	89.74	89.04
mBERT	85.88	87.09	86.51	*88.47*	88.28	88.37
XLM-RoBERTa	87.55	*91.19*	89.34	87.50	90.76	89.10

BibTeX entry and citation info

@misc{tanvir2020estbert,
      title={EstBERT: A Pretrained Language-Specific BERT for Estonian}, 
      author={Hasan Tanvir and Claudia Kittask and Kairit Sirts},
      year={2020},
      eprint={2011.04784},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}