Continued, on-premise, pre-training of MedRoBERTa.nl using de-identified Electronic Health Records from the University Medical Center Utrecht, related to the cardiology domain.

Data statistics

Sources:

  • Dutch medical guidelines (FMS, NHG)

  • NtvG papers

  • PMC abstracts translated using GeminiFlash 1.5

  • Number of tokens: 1.47B, of which 1B from UMCU EHRs

  • Number of documents: 5.8M, of which 3.5M UMCU EHRs

  • Average number of tokens per document: 253

  • Median number of tokens per document: 124

Training

  • Effective batch size: 240
  • Learning rate: 1e-4
  • Weight decay: 1e-3
  • Learning schedule: linear, with 25_000 warmup steps
  • Num epochs: 3

Train perplexity: 3.0

Validation perplexity: 4.0

Acknowledgements

This work was done together with the Amsterdam UMC, in the context of the DataTools4Heart project.

Downloads last month
31
Safetensors
Model size
126M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for UMCU/CardioMedRoBERTa.nl

Base model

CLTL/MedRoBERTa.nl
Finetuned
(7)
this model