--- license: mit datasets: - ncbi/pubmed language: - en tags: - biomedical-text - nlp - biomedical-nlp - discharge-notes - healthcare - pubmed pipeline_tag: feature-extraction base_model: - answerdotai/ModernBERT-base --- Below is a draft Hugging Face model card for Clinical ModernBERT. The card emphasizes the masked language modeling setup and describes the pre-training optimizations. --- # Clinical ModernBERT Clinical ModernBERT is a state-of-the-art encoder-based transformer tailored specifically for biomedical and clinical text handling context length up to **8192 tokens**. Building on the innovations introduced by ModernBERT, this model extends the context window to 8,192 tokens and incorporates domain-specific vocabulary refinements. It is designed to produce semantically rich representations that capture both the nuanced syntax of biomedical literature and the intricate semantics of clinical narratives. ## Usage Pretrained model weights and tokenizer artifacts are provided to facilitate easy integration with your downstream biomedical NLP tasks: ```python from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('Simonlee711/Clinical_ModernBERT') tokenizer = AutoTokenizer.from_pretrained('Simonlee711/Clinical_ModernBERT') ``` ## Model Overview Below is a table summarizing ModernBERT's key architectural components and their benefits: | **Feature** | **Description** | **Benefit** | |------------------------------|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------| | Extended Context Length | Processes sequences up to **8,192 tokens**. | Captures long-range dependencies and full document contexts, essential for complex linguistic tasks. | | GeGLU Activation | Uses the GeGLU activation, a gated variant of GeLU. | Enhances non-linear representation and model stability by allowing controlled information flow. | | Rotary Positional Embeddings | Implements RoPE to encode relative positional information. | Provides robust handling of positional data, especially beneficial for extended contexts. | | Flash Attention | Employs Flash Attention to compute self-attention blockwise. | Reduces memory overhead from quadratic to near-linear complexity, enabling efficient processing of long sequences. | This model leverages a suite of modern architectural advancements including rotary positional embeddings (RoPE), Flash Attention for near-linear memory usage with extended contexts, and GeGLU activation layers that enhance representational capacity by integrating smooth gating mechanisms. By initializing from a ModernBERT-base checkpoint and applying domain-specific pre-training on approximately 40 million PubMed abstracts combined with MIMIC-IV clinical notes, Clinical ModernBERT is optimized to serve in tasks such as retrieval-augmented generation, fine-grained text classification, and domain-specific entity extraction. ## Pre-training Optimizations | **Parameter** | **Value** | **Description** | |--------------------------|---------------------|---------------------------------------------------------------------| | Total Tokens | 13,004,002,816 | Total number of tokens in the unified pre-training corpus | | Pre-training Corpus | PubMed + MIMIC-IV + Medical Codes & Descriptions | Combination of approximately 40M PubMed abstracts and MIMIC-IV Clinical notes and medical code and description pairs (ICD 9 Code 250.00: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled.) | | Training Steps | 150,000 | Total number of masked language modeling (MLM) training steps | | Batch Size | 128 | Batch size used during training | ## Masked Language Modeling (MLM) Setup Clinical ModernBERT is pre-trained using a multi-phase masked language modeling (MLM) strategy. A custom collator dynamically adjusts the masking probability—beginning at 30% and decreasing to 15% over the course of training—to emphasize medically relevant tokens (e.g., drug names, procedural codes). The MLM objective is defined as $$ \mathcal{L}_{\text{MLM}} = - \sum_{i \in \mathcal{M}} \log p_\theta(x_i \mid \tilde{x}), $$ where the model predicts masked tokens given their context. The table below summarizes the MLM evaluation framework without detailing specific metric scores: | **Metric** | **Top-1 Accuracy** | **Top-5 Accuracy** | **Top-10 Accuracy** | **Top-25 Accuracy** | |------------------|--------------------|--------------------|---------------------|---------------------| | **Value (%)** | 63.31 | 79.67 | 83.33 | 88.10 | This structure captures the granularity at which the model’s recovery of masked tokens is evaluated, with the understanding that higher top-\(k\) values reflect a broader lexical recall, and the model consistently ranks clinically appropriate tokens among its predictions. ## Intended Use Clinical ModernBERT is ideally suited for tasks that demand an in-depth understanding of biomedical language. It is particularly valuable for clinical information retrieval, narrative classification, and structured medical coding. Researchers and practitioners may fine-tune this model for specialized downstream applications such as electronic health record analysis, clinical decision support systems, and evidence-based medical literature retrieval. ## Citations and Pre-training Source Code The source code can be found here: [Clinical ModernBERT Github](https://github.com/Simonlee711/Clinical_ModernBERT) Citing Model ``` @misc{simon_lee_2025, author = { Simon Lee }, title = { Clinical_ModernBERT (Revision 24e72d6) }, year = 2025, url = { https://huggingface.co/Simonlee711/Clinical_ModernBERT }, doi = { 10.57967/hf/4999 }, publisher = { Hugging Face } } ``` Citing Paper ``` @misc{lee2025clinicalmodernbertefficientlong, title={Clinical ModernBERT: An efficient and long context encoder for biomedical text}, author={Simon A. Lee and Anthony Wu and Jeffrey N. Chiang}, year={2025}, eprint={2504.03964}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.03964}, } ``` ## Questions email (simonlee711@g.ucla.edu)