EriBERTa
A Bilingual Pre-Trained Language Model
for Clinical Natural Language Processing

We introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora. We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain, showcasing its superior capabilities in understanding medical texts and extracting meaningful information. Moreover, EriBERTa exhibits promising transfer learning abilities, allowing for knowledge transfer from one language to another. This aspect is particularly beneficial given the scarcity of Spanish clinical data.

📖 Paper: EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing

How to Get Started with the Model

You can load the model using:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HiTZ/EriBERTa-base")
model = AutoModelForMaskedLM.from_pretrained("HiTZ/EriBERTa-base")

Model Description

Developed by: Iker De la Iglesia, Aitziber Atutxa, Koldo Gojenola, and Ander Barrena
Contact: Iker De la Iglesia and Ander Barrena
Language(s) (NLP): English, Spanish
License: apache-2.0
Funding:
- The Spanish Ministry of Science and Innovation, MCIN/AEI/ 10.13039/501100011033/FEDER projects:
  - Proyectos de Generación de Conocimiento 2022 (EDHIA PID2022-136522OB-C22)
  - DOTT-HEALTH/PAT-MED PID2019-543106942RB-C31.
  - EU NextGeneration EU/PRTR (DeepR3 TED2021-130295B-C31, ANTIDOTE PCI2020-120717-2 EU ERA-Net CHIST-ERA).
- Basque Government:
  - IXA IT1570-22.

Model Details

Pre-Training settings for EriBERTa-base.
Param. no.	~125M
Vocabulary size	64k
Sequence Length	512
Token/step	2M
Steps	125k
Total Tokens	4.5B
Scheduler	Linear with Warm-up
Peak LR	2.683e-4
Warm-up Steps	7.5k

Training Data

Data sources and word counts by language.
Language	Source	Words
English	ClinicalTrials	127.4M
	EMEA	12M
	PubMed	968.4M
	MIMIC-III	206M
Spanish	EMEA	13.6M
	PubMed	8.4M
	Medical Crawler	918M
	SPACC	350K
	UFAL	10.5M
	WikiMed	5.2M

Limitation and Bias

EriBERTa is currently optimized for masked language modeling to perform the Fill Mask task. While its potential for fine-tuning on downstream tasks such as Named Entity Recognition (NER) and Text Classification has been evaluated, it is recommended to validate and test the model for specific applications before deploying it in production to ensure its effectiveness and reliability.

Due to the scarcity of medical-clinical corpora, the EriBERTa model has been trained on a corpus gathered from multiple sources, including web crawling. Thus, the employed corpora may not encompass all possible linguistic and contextual variations present in clinical language. Consequently, the model may exhibit limitations when applied to specific clinical subdomains or rare medical conditions not well-represented in the training data.

Biases

Data Collection Bias: The training data for EriBERTa was collected from various sources, some of them using web crawling techniques. This method may introduce biases related to the prevalence of certain types of content, perspectives, and language usage patterns. Consequently, the model might reflect and propagate these biases in its predictions.
Demographic and Linguistic Bias: Given that the web-sourced corpus may not equally represent all demographic groups or linguistic nuances, the model may perform disproportionately well for certain populations while underperforming for others. This could lead to disparities in the quality of clinical data processing and information retrieval across different patient groups.
Unexamined Ethical Considerations: As of now, no comprehensive measures have been taken to systematically evaluate the ethical implications and biases embedded in EriBERTa. While we are committed to addressing these issues, the current version of the model may inadvertently perpetuate existing biases and ethical concerns inherent in the data.

Disclaimer

EriBERTa has not been designed or developed to be used as a medical device. Any output should be verified by a Healthcare Professional, and no direct diagnosis should be claimed. The model's output may not always be completely reliable. Due to the nature of language models, predictions may be incorrect or biased.

We do not take any liability for the use of this model, and it should ideally be fine-tuned and tested before application. It must not be used as a medical tool or for any critical decision-making processes without thorough validation and supervision by qualified professionals.

Citing information

@misc{delaiglesia2023eriberta,
      title={{EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing}}, 
      author={Iker De la Iglesia and Aitziber Atutxa and Koldo Gojenola and Ander Barrena},
      year={2023},
      eprint={2306.07373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Downloads last month: 1,064

Safetensors

Model size

0.1B params

Tensor type

I64

F32

Model tree for HiTZ/EriBERTa-base

Finetunes

1 model

Dataset used to train HiTZ/EriBERTa-base

Space using HiTZ/EriBERTa-base 1

Collection including HiTZ/EriBERTa-base

EriBERTa

Collection

2 items • Updated 22 days ago • 1

Paper for HiTZ/EriBERTa-base

EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing

Paper • 2306.07373 • Published Jun 12, 2023

EriBERTa A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing