XLM-RoBERTa-Malach-v4

Version 4 of XLM-RoBERTa-large with continued pretraining on speech transcriptions of the Visual History Archive. Part 1 of the used training data is ASR, Part 2 is machine translated from Part 1 using MADLAD-400-3B-MT.

Training Data

ASR data: cs, de, en, nl
MT data: cs, da, de, en, hu, nl, pl

Total tokens: 4.9B
Training tokens: 4.4B
Test tokens: 490M

Danish, Hungarian, and Polish ASR data are not yet available. The same documents are used in all 7 languages, but their proportions in number of tokens might differ. A random split of 10% is used as a test dataset, preserving the language proportions of the training data. The test set has been masked with 15% probability.

The data preprocessing (reading, tokenization, concatenation, splitting, and masking of the test dataset) takes around 2.5 hours per language using 8 CPUs.

Training Details

Parameters are mostly replicated from [1] Appendix B:
AdamW with eps=1e-6, beta1=0.9, beta2=0.98, weight decay=0.01, learning rate=2e-5 with linear schedule and linear warmup for 6% of the first training steps. Trained with dynamic masking on 4 L40s with per-device batch size 8, using 64 gradient accumulation steps for an effective batch size of 2048, for 1 epoch (34k steps) on an MLM objective.

Main differences from XLM-RoBERTa-large:
AdamW instead of Adam, effective batch size 2048 instead of 8192, and 34k steps instead of 500k due to a smaller dataset. Smaller learning rate, since greater ones lead to overfitting. This somewhat aligns with [2] and [3], who continue the pretraining on small data.

The training takes around 24 hours but can be significantly reduced with more GPUs.

Evaluation

Since the model sees translations of evaluation samples during the training, an additional domain-specific dataset has been prepared for unbiased evaluation. For this dataset, sentences have been extracted from a NER dataset based on the EHRI Online Editions in 9 languages, not including Danish [4]. It is split into two evaluation datasets EHRI-6 (714k tokens) and EHRI-9 (877k tokens), the latter one including 3 unseen languages.

Perplexity (42M token test set): 3.0177 -> 2.3273
Perplexity (490M token test set): 2.5179 -> 1.8564
Perplexity (EHRI-6): 3.2081 -> 2.9780
Perplexity (EHRI-9): 3.2696 -> 3.1681

Improvements from the XLM-RoBERTa-large checkpoint. The 490M test set is split from the dataset used to train this model and has a greater proportion of machine translations than the 42M test set.

Perplexity per language in the EHRI data:

Model	cs (195k)	de (356k)	en (81k)	fr (3.5k)	hu (45k)	nl (2.5k)	pl (34k)	sk (6k)	yi (151k)
XLM-RoBERTa-large	3.1553	3.4038	3.0588	2.0579	2.8928	2.9133	2.5284	2.6245	4.0217
XLM-R-Malach-v4	2.8156	3.1910	2.8895	2.0514	2.8745	2.9333	2.4120	2.6361	4.1083

References

[1] RoBERTa: A Robustly Optimized BERT Pretraining Approach
[2] Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
[3] The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings
[4] Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools

ChrisBridges
/

xlm-r-malach-v4

XLM-RoBERTa-Malach-v4

Training Data

Training Details

Evaluation

References

Model tree for ChrisBridges/xlm-r-malach-v4