Journaux-LM

Journaux-LM

The Journaux-LM is a language model pretrained on historical French newspapers. Technically the model itself is an ELECTRA model, which was pretrained with the TEAMS approach.

Datasets

Version 1 of the Journaux-LM was pretrained on the following publicly available datasets:

In total, the pretraining corpus has a size of 408GB.

Benchmarks (Named Entity Recognition)

We compare our Zeitungs-LM directly to the French Europeana BERT model (as Zeitungs-LM is supposed to be the successor of it) on various downstream tasks from the hmBench repository, which is focussed on Named Entity Recognition.

We report averaged micro F1-Score over 5 runs with different seeds and use the best hyper-parameter configuration on the development set of each dataset to report the final test score.

Development Set

The results on the development set can be seen in the following table:

Model \ Dataset AjMC ICDAR LeTemps NewsEye HIPE-2020 Avg.
Europeana BERT 85.7 77.63 67.14 82.68 85.98 79.83
Journaux-LM v1 86.25 78.51 67.76 84.07 88.17 80.95

Our Journaux-LM leads to a performance boost of 1.12% compared to the German Europeana BERT model.

Test Set

The final results on the test set can be seen here:

Model \ Dataset AjMC ICDAR LeTemps NewsEye HIPE-2020 Avg.
Europeana BERT 81.06 78.17 67.22 73.51 81.00 76.19
Journaux-LM v1 83.41 77.73 67.11 74.48 83.14 77.17

Our Journaux-LM beats the French Europeana BERT model by 0.98%.

Changelog

  • 02.11.2024: Initial version of the model. More details are coming very soon!

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️

Made from Bavarian Oberland with ❤️ and 🥨.

Downloads last month
28
Safetensors
Model size
135M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train PleIAs/journaux-lm-v1