HistNERo: Historical Named Entity Recognition for the Romanian Language
Abstract
This work introduces HistNERo, the first Romanian corpus for Named Entity Recognition (NER) in historical newspapers. The dataset contains 323k tokens of text, covering more than half of the 19th century (i.e., 1817) until the late part of the 20th century (i.e., 1990). Eight native Romanian speakers annotated the dataset with five named entities. The samples belong to one of the following four historical regions of Romania, namely Bessarabia, Moldavia, Transylvania, and Wallachia. We employed this proposed dataset to perform several experiments for NER using Romanian pre-trained language models. Our results show that the best model achieved a strict F1-score of 55.69%. Also, by reducing the discrepancies between regions through a novel domain adaption technique, we improved the performance on this corpus to a strict F1-score of 66.80%, representing an absolute gain of more than 10%.
Community
Hey @avramandrei , very interesting paper and great new resource for Romanian!
I am definitely extending my hmBench for this dataset!
Do you btw. know any additional publicly available corpora for Historical Romanian? I am thinking of "Public Domain" datasets as @Pclanglais and team are collecting, e.g. see the Common Corpus collection.
If there are corpora available, I would love to extend my historical multilingual language models with it :)
@stefan-it Thank you! The RODICA dataset (from which the documents used in this dataset were collected) is the only Historical Romanian corpus that I know of. Hope it helps you! :)
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper