XLM-RoBERTa-Malach-v2

Version 2 of XLM-RoBERTa-large with continued pretraining on speech transcriptions of the Visual History Archive. Part 1 of the used training data is ASR, Part 2 is machine translated from Part 1 using MADLAD-400-3B-MT.

Possibly overfitting. Use the new version instead.

Training Data

ASR data: cs, de, en, nl
MT data: cs, da, de, en, hu, nl, pl

Total tokens: 4.9B
Training tokens: 4.4B
Test tokens: 490M

Danish, Hungarian, and Polish ASR data are not yet available. The same documents are used in all 7 languages, but their proportions in number of tokens might differ. A random split of 10% is used as a test dataset, preserving the language proportions of the training data. The test set has been masked with 15% probability.

The data preprocessing (reading, tokenization, concatenation, splitting, and masking of the test dataset) takes around 2.5 hours per language using 8 CPUs.

Training Details

Parameters are mostly replicated from [1] Appendix B:
AdamW with eps=1e-6, beta1=0.9, beta2=0.98, weight decay=0.01, learning rate=1e-4 with linear schedule and linear warmup for 6% of the first training steps. Trained with dynamic masking on 8 A100s with per-device batch size 8, using 32 gradient accumulation steps for an effective batch size of 2048, for 1 epoch (33,622 steps) on an MLM objective.

Main differences from XLM-RoBERTa-large:
AdamW instead of Adam, effective batch size 2048 instead of 8192, and 34k steps instead of 500k due to a smaller dataset. Smaller learning rate, since 4e-4 lead to overfitting and increased the perplexity to 233.4169. This somewhat aligns with [2] and [3], who continue the pretraining on small data.

The training takes around 9 hours.

Evaluation

Since the model sees translations of evaluation samples during the training, an additional domain-specific dataset has been prepared for unbiased evaluation. For this dataset, sentences have been extracted from a NER dataset based on the EHRI Online Editions in 9 languages, not including Danish [4]. It is split into two evaluation datasets EHRI-6 (714k tokens) and EHRI-9 (877k tokens), the latter one including 3 unseen languages.

Perplexity (42M token test set): 3.0177 -> 2.2252
Perplexity (490M token test set): 2.5179 -> 1.7540
Perplexity (EHRI-6): 3.2081 -> 3.2472
Perplexity (EHRI-9): 3.2696 -> 10.7375

Improvement from the XLM-RoBERTa-large checkpoint. The larger test is split from the dataset used to train this model.

References

[1] RoBERTa: A Robustly Optimized BERT Pretraining Approach
[2] Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
[3] The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings
[4] Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools

ChrisBridges
/

xlm-r-malach-v2

XLM-RoBERTa-Malach-v2

Training Data

Training Details

Evaluation

References

Model tree for ChrisBridges/xlm-r-malach-v2