Token Classification
PyTorch
English
bert
medical

Model card for OpenBioNER

We introduce OpenBioNER, a lightweight BERT-based model tailored for open-domain Biomedical NER. This model can find unseen target entity types based solely on their natural language descriptions, eliminating the need for retraining.

OpenBioNER is pretrained on synthetic silver annotations generated through LLM self-supervision. Extensive experiments demonstrate that OpenBioNER outperforms specialized LLMs, such as UniNER and GPT-4o, achieving an F1 score improvement of up to 10% in zero-shot settings across various biomedical benchmarks. In comparison to smaller baselines such as GLiNER, our model achieves better performance while using up to 4x fewer parameters.

Links

Installation

To use this model, you must install the IBM Zshot library (from main branch before next release):

!pip install -U zshot==0.0.11 datasets gliner
!python -m spacy download en_core_web_sm

Usage

import spacy

from zshot import PipelineConfig, displacy
from zshot.linker import LinkerSMXM
from zshot.evaluation.metrics._seqeval._seqeval import Seqeval
from zshot.utils.data_models import Entity
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report

# define your list of candidate entity types
entities = [
     Entity(name='BACTERIUM', description='A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. Examples include species like Streptococcus pneumoniae and Streptomyces ahygroscopicus.', vocabulary=None),
]

nlp = spacy.blank("en")
nlp_config = PipelineConfig(
    linker=LinkerSMXM(model_name="disi-unibo-nlp/openbioner-base"),
    entities=entities,
    device='cuda' # or 'cpu' if GPU not available
)
nlp.add_pipe("zshot", config=nlp_config, last=True)


sentence = "Impact of cofactor - binding loop mutations on thermotolerance and activity of E. coli transketolase"
doc = nlp(sentence)

displacy.render(doc, style="ent")

Performance

OpenBioNER outperforms all competing models, achieving the highest average performance across all datasets.

Model Size AnatEM NCBI JNLPBA BC2GM BC4CHEMD BC5CDR JNLPBA-R MedMentions-R AVG
GPT-4o - 38.7 50.0 41.9 37.3 36.4 66.4 26.6 49.1 43.3
UniNER 7B 25.1 60.4 48.1 46.2 47.9 68.0 50.2 53.4 49.9
GLiNER_large-v1 459M 33.3 61.9 57.1 47.9 43.1 66.4 51.9 53.4 51.9
OpenBioNER (Ours) 110M 35.2 58.5 57.1 49.1 48.0 60.4 63.9 50.9 52.9
OpenBioNER (Ours) - Zshot 110M 34.8 57.8 56.8 49.5 47.1 60.1 64.6 52.9 53.0

⚠️ Disclaimer: Please note that running evaluations using the zshot library may lead to slightly different results on certain benchmarks compared to those reported in the paper (above). This discrepancy is due to differences in token alignment: zshot uses spaCy's character-based span matching, while our experiments use token-level alignment as handled by BERT-based NER pipelines. These differences can affect how entity spans are matched and evaluated, particularly in cases with subword tokenization or punctuation.

Descriptions

Below we provide all the descriptions used to evaluate OpenBioNER for each dataset.


Negative Class

This is the description used as NEG class (e.g. not an entity) for all the datasets, execept for MedMentions-Rare:

Coal, water, oil, etc. are normally used for traditional electricity generation. However using liquefied natural gas as fuel for joint circulatory electricity generation has advantages. The chief financial officer is the only one there taking the fall. It has a very talented team, eh. What will happen to the wildlife? I just tell them, you've got to change. They're here to stay. They have no insurance on their cars. What else would you like? Whether holding an international cultural event or setting the city's cultural policies, she always asks for the participation or input of other cities and counties.


NCBI

TYPE Description
DISEASE A disease is a medical condition that disrupts normal bodily functions or structures, affecting various organs or systems, and leading to symptoms like muscle weakness, fatigue, stiffness, or cognitive impairment. Diseases can impact muscles, the nervous system, heart, eyes, and more, and may be chronic or acute, such as diabetes, cardiovascular or neurological disorders, and cancer-related conditions like lymphoblastic leukemia or lymphoma.

AnatEM

TYPE Description
ANATOMY The anatomy refers to biological components at various scales, including cells, tissues, and organs. These entities can be identified by proper nouns referring to cell types (e.g., HeLa cells, neurospheres, NSCLC, SCC), body parts (e.g., serum, blood) or biological substances (e.g., vegetables, meats, cow milk) or tumors.

BC4CHEMD

TYPE Description
CHEMICAL Chemicals are substances that are composed of one or more elements, typically consisting of atoms bonded together by chemical bonds. They can be naturally occurring, such as vitamins or sterols, or synthesized, like alkylcarbazoles or tetrachlorodibenzo-p-dioxins (TCDD). Chemicals can also be modified or combined to form new compounds, such as esters or polymers.

BC2GM

TYPE Description
GENE A gene is a unit of heredity that carries information from one generation to the next and is composed of DNA sequences that encode the instructions for the development, growth, and function of an organism. It can be a segment of DNA that is passed from one generation to the next and is responsible for the transmission of traits from parents to offspring. A gene is often represented using a three-letter code (e.g., trios, ABL, DNA-PK).

BC5CDR

TYPE Description
CHEMICAL Chemicals are substances that are composed of atoms, either bonded together in a molecule or as a mixture of different substances. This includes medications (e.g., nitroarginine methyl ester, nifedipine, prednisolone, methyldopa), compounds (e.g., potassium, calcium, ammonium), and other substances that can have various effects on the body.
DISEASE Diseases are any medical condition that affects the normal functioning of the body, resulting in symptoms, discomfort, or potentially life-threatening complications. This includes chronic and acute disorders, conditions affecting specific bodily systems, cancer-related conditions, and complications arising from medical treatments or external factors.

JNLPBA

TYPE Description
PROTEIN A protein is a large biomolecule composed of one or more chains of amino acids, essential for structure and function within cells. Proteins serve as enzymes, receptors, and signaling molecules, playing critical roles in hormone action, immune response, and cellular communication.
DNA DNA refers to a molecule that contains the genetic instructions used in the development and function of all living organisms. It is composed of two strands of nucleotides that are coiled together in a double helix structure.
CELL_TYPE A cell type refers to a specific category of cells defined by characteristic morphology, function, and molecular markers. Examples include lymphocytes, leukocytes, mononuclear cells, polymorphonuclear leukocytes, and B-lymphoblastoid cells.
CELL_LINE A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments.
RNA RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins.

JNLPBA-Rare

TYPE Description
CELL_LINE A cell line is a population of cells derived from a single cell, cultured in vitro or in vivo. It can be normal or transformed, with genetic changes like mutations. Cell lines, such as B-cells or HeLa cells, are used in research to study cellular processes, model diseases, and develop treatments.
RNA RNA is a type of nucleic acid that plays a crucial role in the transmission of genetic information from DNA to proteins. It is a single-stranded molecule composed of nucleotides, and its primary function is to carry genetic information from the nucleus to the ribosomes, where it is translated into proteins.

MedMentions-Rare

TYPE Description
NEG In this study, we fabricated prevascularized synthetic device ports to help mitigate this limitation. Thus, the optimum range of pore size for prevascularization of these membranes was estimated to be 75 - 100 ΞΌm. A total of 51 patients were included, 16 in group I and 35 in group II.
Bacterium (T007) A bacterium refers to a type of microorganism that can exist as a single cell and may cause infections or play a role in various biological processes. Examples include species like Streptococcus pneumoniae and Streptomyces ahygroscopicus.
Body Substance (T031) A body substance is any material produced by or found within the body, such as blood, serum, saliva, sweat, or gastric acid. Specific examples include serum cytokine levels for immune responses, blood lipids for metabolic studies, and hemolymph glucose for stress responses.
Food (T168) A food refers to any substance consumed to provide nutritional support for the body. This includes a wide range of items such as snacks, meat, dairy products, grains like wheat, and edible substances like carbohydrates, proteins, and fats.
Body System (T022) A body system consists of interconnected organs and tissues working together to carry out essential functions. Examples include the gastrointestinal tract for digestion, the nervous system for sensory and motor control, the hematological system for blood-related functions, and the endocrine system for hormone regulation.
Professional or Occupational Group (T097) A professional refers to individuals who share the same profession, occupation, or role within a specific field. Examples include cardiologists, psychologists, assessors, hospice staff, and volunteers.

🧬 How to Write Effective Entity Type Descriptions

Entity type descriptions are crucial for improving generalization in OpenBioNER. Well-written descriptions help models disambiguate types, handle rare classes, and align with real-world usage across diverse datasets.

βœ… Best Practices

  • Start with a clear definition: Briefly explain what the entity type is.

  • Include functions or context: Add what it does, its purpose, or where it appears.

  • List 3–5 concrete examples: Use domain-relevant examples (e.g., real diseases, proteins, or food items).

  • Mention subtypes or synonyms (optional): Helps capture lexical variation and rare mentions.

  • Keep it concise: 1–3 well-structured sentences are ideal.

⚠️ Common Mistakes to Avoid

  • Vague or overly generic descriptions
  • No examples
  • Just a list of terms
  • Redundant or circular wording

πŸ§ͺ Template (Recommended Format)

A [TYPE] refers to [concise definition]. It includes examples such as [example1], [example2], and [example3].

Authors

πŸ“¬ Contacts

For questions, collaborations, or feedback, feel free to reach out:

Downloads last month
122
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for disi-unibo-nlp/openbioner-base

Finetuned
(85)
this model

Datasets used to train disi-unibo-nlp/openbioner-base

Space using disi-unibo-nlp/openbioner-base 1

Collection including disi-unibo-nlp/openbioner-base