Model Card for Model ID

DA-Bert_Old_News_V1 is the first version of a transformer trained on Danish historical texts from the period during Danish Absolutism (1660-1849). It is created by researchers at Aalborg University. The aim of the model is to create a domain-specific model to capture meaning from texts that are far enough removed in time that they no longer read like contemporary Danish.

Model Details

Pretrained BERT model on MLM task. Training data: ENO (Enevældens Nyheder Online) – a corpus of news articles, announcements and advertisements from Danish and Norwegian newspapers from the period 1762 to 1848. The model has been trained on a subset consisting of about 260m words. The data was created using a tailored Transkribus Pylaia-model and has an error rate of around 5% on word level.

Model Description

Architecture: BERT

Pretraining Objective: Masked Language Modeling (MLM)

Sequence Length: 512 tokens

Tokenizer: Custom WordPiece tokenizer

Developed by: CALDISS
Shared by JohanHeinsen:
Model type: BERT
Language(s) (NLP): Danish
License: MIT

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

This model is designed for...

Domain-specific masked token prediction
Embedding extraction for semantic search
Further fine-tuning

Direct Use

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

As the model is trained on the ENO dataset the model is not used for modern Danish text because of its inherent historical training data.

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

The model is based on historical texts that express a range of antiquated worldviews. These include racist, anti-democratic and patriarchal sentiments. This makes it utterly unfit for many use cases. It can, however, be used to examine such biases in Danish history.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing

Texts shorter than 35 chars were removed. Texts including a predetermined amount of german, latin or grammatical errors were removed. Extra whitespaces were also removed.

Training Hyperparameters

Training regime: [More Information Needed]
Model trained for roughly 45 hours on the provided HPC-system.
The MLM-prob was defined as .15

Training arguments: eval_strategy="steps", overwrite_output_dir=True, num_train_epochs=15, per_device_train_batch_size=16, gradient_accumulation_steps=4, per_device_eval_batch_size=64, logging_steps=500, learning_rate=5e-5, save_steps=1000, save_total_limit=5, load_best_model_at_end=True, metric_for_best_model="eval_loss", greater_is_better=False, fp16=torch.cuda.is_available(), warmup_steps=2000, warmup_ratio=0.03, weight_decay=0.01, lr_scheduler_type="cosine", dataloader_num_workers=4, dataloader_pin_memory=True, save_on_each_node=False, ddp_find_unused_parameters=False, optim="adamw_torch", local_rank=local_rank,

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

Cross-entropy loss. Standard use for BERT with MLM training.

Avg. Loss on test-set

Perplexity. Calculated based on loss value.

Results

Loss: 2.08

Avg. Loss on test-set: 2.07

Perplexity: 7.65

Summary

Model Examination [optional]

[More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

Ucloud-cloud infrastructure available at the danish universities

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Model Card Authors

Matias Appel ([email protected])
Johan Heinsen ([email protected])

Model Card Contact

CALDISS, AAU: www.caldiss.aau.dk