Continued, off-premise, pre-training of MedRoBERTa.nl using about 50GB of open Dutch and translated English corpora.

Data statistics

Sources:

  • Dutch: medical guidelines (FMS, NHG)
  • Dutch: NtvG papers
  • English: Pubmed abstracts
  • English: PMC abstracts translated using DeepL
  • English: Apollo guidelines, papers and books
  • English: Meditron guidelines
  • English: MIMIC3
  • English: MIMIC CXR
  • English: MIMIC4

All translated (if not with DeepL) with a combination of GeminiFlash 1.5/GPT4o mini, MariaNMT, NLLB200.

  • Number of tokens: 15B
  • Number of documents: 27M

Training

  • Effective batch size: 5120
  • Learning rate: 2e-4
  • Weight decay: 1e-3
  • Learning schedule: linear, with 5_000 warmup steps
  • Num epochs: ~3

Train perplexity: 3.0 Validation perplexity: 3.0

Acknowledgement

This work was done together with the Amsterdam UMC, in the context of the DataTools4Heart project.

We were happy to be able to use the Google TPU research cloud for training the model.

Downloads last month
31
Safetensors
Model size
166M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for UMCU/CardioBERTa.nl_base

Base model

CLTL/MedRoBERTa.nl
Finetuned
(7)
this model