Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.
Data statistics
Sources:
- Dutch: medical guidelines (FMS, NHG)
- Dutch: NtvG papers
- English: Pubmed abstracts
- English: PMC abstracts translated using DeepL
- English: Apollo guidelines, papers and books
- English: Meditron guidelines
- English: MIMIC3
- English: MIMIC CXR
- English: MIMIC4
All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.
- Number of tokens: 15B
- Number of documents: 27M
Training
- Effective batch size: 5120
- Learning rate: 2e-4
- Weight decay: 1e-3
- Learning schedule: linear, with 5_000 warmup steps
- Num epochs: ~3
Train perplexity: 3.0 Validation perplexity: 3.0
Acknowledgement
This work was done together with the Amsterdam UMC, in the context of the DataTools4Heart project.
We were happy to be able to use the Google TPU research cloud for training the model.
- Downloads last month
- 31
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for UMCU/CardioBERTa.nl_base
Base model
CLTL/MedRoBERTa.nl