hmBERT 64k

non-profit

AI & ML interests

Pretraining Historical Multilingual Language Models

hmBERT 64k

Historical Multilingual Language Models for Named Entity Recognition. The following languages are covered by hmBERT:

  • English (British Library Corpus - Books)
  • German (Europeana Newspaper)
  • French (Europeana Newspaper)
  • Finnish (Europeana Newspaper)
  • Swedish (Europeana Newspaper)

More details can be found in our GitHub repository and in our hmBERT paper.

The hmBERT 64k model is a 12-layer BERT model with a 64k vocab.

Leaderboard

We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table shows an overview of used datasets:

Language Datasets
English AjMC - TopRes19th
German AjMC - NewsEye - HIPE-2020
French AjMC - ICDAR-Europeana - LeTemps - NewsEye - HIPE-2020
Finnish NewsEye
Swedish NewsEye
Dutch ICDAR-Europeana

All results can be found in the hmLeaderboard.

Acknowledgements

We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️

datasets

None public yet