bioner_gnormplus

This is a named entity recognition model fine-tuned from the microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext model. It predicts spans with 3 possible labels. The labels are DomainMotif, FamilyName and Gene.

The code used for training this model can be found at https://github.com/Glasgow-AI4BioMed/bioner along with links to other biomedical NER models trained on well-known biomedical corpora. The source dataset information is below.

Example Usage

The code below will load up the model and apply it to the provided text. It uses a simple aggregation strategy to post-process the individual tokens into larger multi-token entities where needed.

from transformers import pipeline

# Load the model as part of an NER pipeline
ner_pipeline = pipeline("token-classification", 
                        model="Glasgow-AI4BioMed/bioner_gnormplus",
                        aggregation_strategy="max")

# Apply it to some text
ner_pipeline("ZNF598 is a Zinc finger containing E3 ubiquitin ligase.")

# Output:
# [ {"entity_group": "Gene", "score": 0.99889, "word": "znf598", "start": 0, "end": 6},
#   {"entity_group": "DomainMotif", "score": 0.74961, "word": "zinc finger", "start": 12, "end": 23},
#   {"entity_group": "FamilyName", "score": 0.89084, "word": "e3 ubiquitin ligase", "start": 35, "end": 54} ]

Dataset Info

Source: The GNormPlus dataset was downloaded from: https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/gnormplus/

The dataset should be cited with: Wei, Chih-Hsuan, Hung-Yu Kao, and Zhiyong Lu. "GNormPlus: an integrative approach for tagging genes, gene families, and protein domains." BioMed research international 2015.1 (2015): 918710. DOI: 10.1155/2015/918710

Preprocessing: The training set was split 75/25 to create a training and validation set. No changes were made to the annotations. The preprocessing script for this dataset is prepare_gnormplus.py.

Performance

The span-level performance on the test split for the different labels are shown in the tables below. The full performance results are available in the model repo in Markdown format for viewing and JSON format for easier loading. These include the performance at token level (with individual B- and I- labels as the token classifier uses IOB2 token labelling).

Label Precision Recall F1-score Support
DomainMotif 0.602 0.670 0.634 361
FamilyName 0.497 0.569 0.530 1250
Gene 0.856 0.923 0.888 3225
macro avg 0.651 0.721 0.684 4836
weighted avg 0.744 0.812 0.777 4836

Hyperparameters

Hyperparameter tuning was done with optuna and the hyperparameter_search functionality. 100 trials were run. Early stopping was applied during training. The best performing model was selected using the macro F1 performance on the validation set. The selected hyperparameters are in the table below.

Hyperparameter Value
epochs 13.0
learning_rate 4.312024782506724e-05
per_device_train_batch_size 16
weight_decay 0.004637010897989902
warmup_ratio 0.00046632724074153857
Downloads last month
5
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Glasgow-AI4BioMed/bioner_gnormplus

Collection including Glasgow-AI4BioMed/bioner_gnormplus