bioner_medmentions_st21pv

This is a named entity recognition model fine-tuned from the microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext model. It predicts spans with 14 possible labels. The labels are Anatomy, Chemicals & Drugs, Concepts & Ideas, Devices, Disorders, Genes & Molecular Sequences, Geographic Areas, Living Beings, Objects, Occupations, Organizations, Phenomena, Physiology and Procedures.

The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below.

Example Usage

The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed.

from transformers import pipeline

# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification", 
                        model="Glasgow-AI4BioMed/bioner_medmentions_st21pv",
                        aggregation_strategy="max")

# Apply it to some text
ner_pipeline("EGFR T790M mutations have been known to affect treatment outcomes for NSCLC patients receiving erlotinib.")

# Output:
# [ {"entity_group": "Disorders", "score": 0.62466, "word": "egfr t790m mutations", "start": 0, "end": 20},
#   {"entity_group": "Disorders", "score": 0.98835, "word": "nsclc", "start": 51, "end": 56},
#   {"entity_group": "Chemicals & Drugs", "score": 0.97885, "word": "erlotinib", "start": 76, "end": 85} ]

Dataset Info

Source: The ST21pv version of MedMentions was downloaded from: https://github.com/chanzuckerberg/MedMentions/tree/master/st21pv

The dataset should be cited with: Mohan, Sunil, and Donghui Li. "MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts." Automated Knowledge Base Construction (AKBC), 2019, https://openreview.net/forum?id=SylxCx5pTQ. DOI: 10.24432/C5G59C

An overview of semantic types can be found at: https://www.nlm.nih.gov/research/umls/META3_current_semantic_types.html

Preprocessing: The training, validation and test splits were maintained from the original dataset. Concept identifiers (CUIs) were used to map each annotation to its associated UMLS entry to recover semantic types (from the MRSTY.RRF UMLS file). Semantic types provided in MedMentions were not used. Annotations were mapped to specific semantic groups names using the Semantic Groups file available at: https://www.nlm.nih.gov/research/umls/knowledge_sources/semantic_network/index.html. This contrasts with the finegrained version that mapped annotations to semantic types. The preprocessing script for this dataset is prepare_medmentions.py without the --finegrain flag.

Performance

The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling).

Label	Precision	Recall	F1-score	Support
Anatomy	0.656	0.672	0.664	3277
Chemicals & Drugs	0.748	0.745	0.747	7398
Concepts & Ideas	0.515	0.370	0.430	3683
Devices	0.447	0.372	0.406	355
Disorders	0.691	0.641	0.665	8109
Genes & Molecular Sequences	0.506	0.567	0.535	1115
Geographic Areas	0.671	0.737	0.703	598
Living Beings	0.718	0.739	0.728	3994
Objects	0.518	0.598	0.555	336
Occupations	0.367	0.480	0.416	196
Organizations	0.504	0.634	0.561	382
Phenomena	0.206	0.271	0.234	269
Physiology	0.560	0.582	0.571	3833
Procedures	0.597	0.607	0.602	6599
macro avg	0.550	0.573	0.558	40144
weighted avg	0.641	0.630	0.634	40144

Hyperparameters

Hyperparameter tuning was done with optuna and the hyperparameter_search functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below.

Hyperparameter	Value
epochs	4.0
learning_rate	9.767344966191627e-05
per_device_train_batch_size	16
weight_decay	0.025286446963170207
warmup_ratio	0.021367464793327073

Glasgow-AI4BioMed
/

bioner_medmentions_st21pv

bioner_medmentions_st21pv

Example Usage

Dataset Info

Performance

Hyperparameters

Model tree for Glasgow-AI4BioMed/bioner_medmentions_st21pv

Collection including Glasgow-AI4BioMed/bioner_medmentions_st21pv

BioNER Models