Model Card for Model ID

DA-Bert_Old_News_V2 is the second version of a transformer trained on Danish historical texts from the period during Danish Absolutism (1660-1849). It is created by researchers at Aalborg University. The aim of the model is to create a domain-specific model to capture meaning from texts that are far enough removed in time that they no longer read like contemporary Danish.

Model Details

Pretrained BERT model on MLM task. Training data: ENO (Enevældens Nyheder Online) – a corpus of news articles, announcements and advertisements from Danish and Norwegian newspapers from the period 1762 to 1848. The model has been trained on a subset consisting of about 360m words. The data was created using a tailored Transkribus Pylaia-model and has an error rate of around 5% on word level.

Model Description

Architecture: BERT

Pretraining Objective: Masked Language Modeling (MLM)

Sequence Length: 512 tokens

Tokenizer: Custom WordPiece tokenizer

Developed by: CALDISS
Shared by: JohanHeinsen
Model type: BERT
Language(s) (NLP): Danish, historical
License: MIT
  • Developed by: Matias Appel, CALDISS Aalborg University
  • Shared by: Johan Heinsen, Aalborg University
  • Model type: Pre-Trained Fill-mask BERT
  • Language(s) (NLP): Historical Danish
  • License: MIT

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper: In progress.
  • Demo:

Uses

This model is designed for...

Domain-specific masked token prediction

Embedding extraction for semantic search

Further fine-tuning

Further fine-tuning is needed for adressing specific use-cases.

Further plans for retraining on more data and annotated data for fine-tuning is still in the works. These models serve as baselines for fine-tuned models that address specific needs.

The model is mostly intended for research purposes in the historical domain. Although not excluded to history.

The model can also serve as a baseline for further fine-tuning a historical BERT-based language model for either Danish or Scandinavian languages for textual or literary purposes.

Direct Use

  • This model can be used out-of-the-box for domain-specific masked token prediction.
  • The model can also be used for basic mean-pooled embeddings on similar data. Results on this may vary as this model is only trained on the MLM task using the transformer trainer-framework.

Downstream Use

  • Model should serve as a baseline for language modelling in danish or scandinavian historical context.
  • Further finetuning is needed for better downstream use.

Out-of-Scope Use

As the model is trained on the ENO dataset the model is not used for modern Danish texts because of its inherent historical training data.

Bias, Risks, and Limitations

The model is heavily limited to the historical period the training data is from. Using this model for masked token prediction on modern Danish or even other scandinavian languages the performance of the model will vary. Further fine-tuning is therefore needed. Training data is from newspapers. A bias towards this type of material and therefore a particular manner of writing is inherent to the model. Newspapers are defined by highly literal language. The model's performance will therefore also vary if using it on more materials defined by figurative language. Small biases and risks also exists in the model based on the errors from the creation of the corpus. As mentioned there is an approximate 5% error on word level which continues onto the pre-trained model. Further work on addressing these biases and risks is planned further down the road. The models is these series are used to address errors in the collected datamaterial, further removing the biases in the data.

Recommendations

The model is based on historical texts that express a range of antiquated worldviews. These include racist, anti-democratic and patriarchal sentiments. This makes it utterly unfit for many use cases. It can, however, be used to examine such biases in Danish history.

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Training Details

The model is trained using the Hugging Face trainer API using the same framework as V1. The model was trained on the Lumi HPC-system using the small-gpu nodes for the first epochs. Further training was conducted using the DeIC Ucloud Infrastruture. The MLM-prob was defined as .15

Training Data

Training data consisted of a 90% split from the ENO-dataset (https://huggingface.co/datasets/JohanHeinsen/ENO).

Training Procedure

Texts shorter than 35 chars were removed. Texts including a predetermined amount of German, Latin or rare words were removed. Extra whitespaces were also removed. A harder segmentation of the news-articles was conducted for this version of the dataset to accomodated bias in the data and to, hopefully, enhance the models learning and lessen the bias of the model in terms of texts being mashed together. This resulted in more datarows and better quality of the text.

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

Model Card Contact

[More Information Needed]

Downloads last month
17
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train CALDISS-AAU/DA-BERT_Old_News_V2