XLM-RoBERTa-Malach-v2
Version 2 of XLM-RoBERTa-large with continued pretraining on speech transcriptions of the Visual History Archive. Part 1 of the used training data is ASR, Part 2 is machine translated from Part 1 using MADLAD-400-3B-MT.
Possibly overfitting. Use the new version instead.
Training Data
ASR data: cs, de, en, nl
MT data: cs, da, de, en, hu, nl, pl
Total tokens: 4.9B
Training tokens: 4.4B
Test tokens: 490M
Danish, Hungarian, and Polish ASR data are not yet available. The same documents are used in all 7 languages, but their proportions in number of tokens might differ. A random split of 10% is used as a test dataset, preserving the language proportions of the training data. The test set has been masked with 15% probability.
The data preprocessing (reading, tokenization, concatenation, splitting, and masking of the test dataset) takes around 2.5 hours per language using 8 CPUs.
Training Details
Parameters are mostly replicated from [1] Appendix B:
AdamW with eps=1e-6, beta1=0.9, beta2=0.98, weight decay=0.01, learning rate=1e-4 with linear schedule and linear warmup for 6% of the first training steps.
Trained with dynamic masking on 8 A100s with per-device batch size 8, using 32 gradient accumulation steps for an effective batch size of 2048, for 1 epoch (33,622 steps) on an MLM objective.
Main differences from XLM-RoBERTa-large:
AdamW instead of Adam, effective batch size 2048 instead of 8192, and 34k steps instead of 500k due to a smaller dataset.
Smaller learning rate, since 4e-4 lead to overfitting and increased the perplexity to 233.4169.
This somewhat aligns with [2] and [3], who continue the pretraining on small data.
The training takes around 9 hours.
Evaluation
Since the model sees translations of evaluation samples during the training, an additional domain-specific dataset has been prepared for unbiased evaluation. For this dataset, sentences have been extracted from a NER dataset based on the EHRI Online Editions in 9 languages, not including Danish [4]. It is split into two evaluation datasets EHRI-6 (714k tokens) and EHRI-9 (877k tokens), the latter one including 3 unseen languages.
Perplexity (42M token test set): 3.0177 -> 2.2252
Perplexity (490M token test set): 2.5179 -> 1.7540
Perplexity (EHRI-6): 3.2081 -> 3.2472
Perplexity (EHRI-9): 3.2696 -> 10.7375
Improvement from the XLM-RoBERTa-large checkpoint. The larger test is split from the dataset used to train this model.
References
[1] RoBERTa: A Robustly Optimized BERT Pretraining Approach
[2] Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
[3] The ParlaSent Multilingual Training Dataset for Sentiment Identification in Parliamentary Proceedings
[4] Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools
- Downloads last month
- 156
Model tree for ChrisBridges/xlm-r-malach-v2
Base model
FacebookAI/xlm-roberta-large