---
license: mit
datasets:
- ncbi/pubmed
language:
- en
tags:
- biomedical-text
- nlp
- biomedical-nlp
- discharge-notes
- healthcare
- pubmed
pipeline_tag: feature-extraction
base_model:
- answerdotai/ModernBERT-base
---

Below is a draft Hugging Face model card for Clinical ModernBERT. The card emphasizes the masked language modeling setup and describes the pre-training optimizations.

---

# Clinical ModernBERT

Clinical ModernBERT is a state-of-the-art encoder-based transformer tailored specifically for biomedical and clinical text handling context length up to **8192 tokens**. Building on the innovations introduced by ModernBERT, this model extends the context window to 8,192 tokens and incorporates domain-specific vocabulary refinements. It is designed to produce semantically rich representations that capture both the nuanced syntax of biomedical literature and the intricate semantics of clinical narratives.

## Usage

Pretrained model weights and tokenizer artifacts are provided to facilitate easy integration with your downstream biomedical NLP tasks:

```python
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('Simonlee711/Clinical_ModernBERT')
tokenizer = AutoTokenizer.from_pretrained('Simonlee711/Clinical_ModernBERT')
```

## Model Overview

Below is a table summarizing ModernBERT's key architectural components and their benefits:

| **Feature**                  | **Description**                                                                                           | **Benefit**                                                                                                      |
|------------------------------|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| Extended Context Length      | Processes sequences up to **8,192 tokens**.                                                                   | Captures long-range dependencies and full document contexts, essential for complex linguistic tasks.             |
| GeGLU Activation             | Uses the GeGLU activation, a gated variant of GeLU.                                                       | Enhances non-linear representation and model stability by allowing controlled information flow.                   |
| Rotary Positional Embeddings | Implements RoPE to encode relative positional information.                                               | Provides robust handling of positional data, especially beneficial for extended contexts.                         |
| Flash Attention              | Employs Flash Attention to compute self-attention blockwise.                                             | Reduces memory overhead from quadratic to near-linear complexity, enabling efficient processing of long sequences. |

This model leverages a suite of modern architectural advancements including rotary positional embeddings (RoPE), Flash Attention for near-linear memory usage with extended contexts, and GeGLU activation layers that enhance representational capacity by integrating smooth gating mechanisms. By initializing from a ModernBERT-base checkpoint and applying domain-specific pre-training on approximately 40 million PubMed abstracts combined with MIMIC-IV clinical notes, Clinical ModernBERT is optimized to serve in tasks such as retrieval-augmented generation, fine-grained text classification, and domain-specific entity extraction.

## Pre-training Optimizations

| **Parameter**            | **Value**           | **Description**                                                     |
|--------------------------|---------------------|---------------------------------------------------------------------|
| Total Tokens             | 13,004,002,816      | Total number of tokens in the unified pre-training corpus           |
| Pre-training Corpus      | PubMed + MIMIC-IV + Medical Codes & Descriptions  | Combination of approximately 40M PubMed abstracts and MIMIC-IV Clinical notes and medical code and description pairs (ICD 9 Code 250.00: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled.)  |
| Training Steps           | 150,000           | Total number of masked language modeling (MLM) training steps       |
| Batch Size               | 128                 | Batch size used during training                                     |


## Masked Language Modeling (MLM) Setup

Clinical ModernBERT is pre-trained using a multi-phase masked language modeling (MLM) strategy. A custom collator dynamically adjusts the masking probability—beginning at 30% and decreasing to 15% over the course of training—to emphasize medically relevant tokens (e.g., drug names, procedural codes). The MLM objective is defined as

$$
\mathcal{L}_{\text{MLM}} = - \sum_{i \in \mathcal{M}} \log p_\theta(x_i \mid \tilde{x}),
$$

where the model predicts masked tokens given their context. The table below summarizes the MLM evaluation framework without detailing specific metric scores:

| **Metric**       | **Top-1 Accuracy** | **Top-5 Accuracy** | **Top-10 Accuracy** | **Top-25 Accuracy** |
|------------------|--------------------|--------------------|---------------------|---------------------|
| **Value (%)**    | 63.31                | 79.67                | 83.33                | 88.10                |

This structure captures the granularity at which the model’s recovery of masked tokens is evaluated, with the understanding that higher top-\(k\) values reflect a broader lexical recall, and the model consistently ranks clinically appropriate tokens among its predictions.


## Intended Use

Clinical ModernBERT is ideally suited for tasks that demand an in-depth understanding of biomedical language. It is particularly valuable for clinical information retrieval, narrative classification, and structured medical coding. Researchers and practitioners may fine-tune this model for specialized downstream applications such as electronic health record analysis, clinical decision support systems, and evidence-based medical literature retrieval.

## Citations and Pre-training Source Code

The source code can be found here: [Clinical ModernBERT Github](https://github.com/Simonlee711/Clinical_ModernBERT)

Citing Model
```
@misc{simon_lee_2025,
	author       = { Simon Lee },
	title        = { Clinical_ModernBERT (Revision 24e72d6) },
	year         = 2025,
	url          = { https://huggingface.co/Simonlee711/Clinical_ModernBERT },
	doi          = { 10.57967/hf/4999 },
	publisher    = { Hugging Face }
}
```

Citing Paper
```
@misc{lee2025clinicalmodernbertefficientlong,
      title={Clinical ModernBERT: An efficient and long context encoder for biomedical text}, 
      author={Simon A. Lee and Anthony Wu and Jeffrey N. Chiang},
      year={2025},
      eprint={2504.03964},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.03964}, 
}
```

## Questions

email (simonlee711@g.ucla.edu)