Medical Jargon Identifier with CRF

A PyTorch model that performs fine-grained medical jargon identification using a RoBERTa-large backbone enhanced by a Conditional Random Field (CRF) layer.
Fine-tuned on the MedReadMe dataset introduced by Jiang & Xu (2024).


🧠 Overview

  • Architecture: RoBERTa-large β†’ Linear classifier β†’ CRF
  • Task: Token-level classification into 7 medical jargon categories + BIO tagging
  • Input: Raw English text (sentences or paragraphs)
  • Output: Word-level spans labeled with jargon type and boundaries

🎯 Supported Jargon Categories

Label (BIO) Meaning
medical-jargon-google-easy Easily Google-able medical terms
medical-jargon-google-hard Complex, hard-to-Google medical terms
medical-name-entity Named diseases, drugs, procedures
general-complex Complex general vocabulary
abbr-medical Medical abbreviations (e.g., ECG, CBC)
abbr-general General abbreviations
general-medical-multisense Words with both lay and medical meanings

πŸ“ Files & Format

  • pytorch_model.bin – model weights
  • config.json – hyper-parameters & label map
  • tokenizer.json, vocab.json, merges.txt – RoBERTa tokenizer assets
  • modeling_jargon.py – custom CRFTokenClassificationModel class
  • requirements.txt – runtime dependencies

πŸ”§ Quick Start

from transformers import AutoTokenizer
from modeling_jargon import CRFTokenClassificationModel
import torch

# 1. Load model and tokenizer
model_name = "DNivalis/med-jargon-crf"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
model = CRFTokenClassificationModel.from_pretrained(model_name)
model.eval()

# 2. Prepare input text
text = "The patient presented with elevated CRP and intermittent AF."
inputs = tokenizer(text, return_tensors="pt")

# 3. Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs["logits"]
    # Decode best sequence using CRF
    predicted_tags = model.decode(logits, inputs["attention_mask"])[0]

# 4. Extract spans from predictions
spans = [(i, model.id2label[tag_id]) for i, tag_id in enumerate(predicted_tags) if tag_id != 0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# 5. Display results
print("Detected medical jargon:")
for token_idx, label in spans:
    # Find continuous spans of the same entity
    end_idx = token_idx + 1
    while (end_idx < len(predicted_tags) and 
           predicted_tags[end_idx] == predicted_tags[token_idx]):
        end_idx += 1
    
    # Convert tokens back to text
    detected_tokens = tokens[token_idx:end_idx]
    detected_text = tokenizer.convert_tokens_to_string(detected_tokens)
    
    print(f"{label}: '{detected_text.strip()}'")

πŸ₯ Supported Tasks

  • Medical jargon detection – binary, 3-class, or 7-category granularity
  • Named-entity recognition – extract spans of medical interest
  • Readability analysis – density of jargon per sentence or document
  • Downstream QA & summarization – filter or simplify complex terms

🌍 Language

English only.


πŸ“š Training Data

Fine-tuned on MedReadMe: 4,520 sentences with fine-grained jargon span annotations, including the novel Google-Easy and Google-Hard categories .


πŸ“– Citation

If you use this model or the underlying dataset, please cite:

@article{jiang2024medreadmesystematicstudyfinegrained,
  title={MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain},
  author={Chao Jiang and Wei Xu},
  year={2024},
  eprint={2405.02144},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2405.02144}
}

πŸ“ License & Usage

Licensed under Apache 2.0.

  • βœ… Allowed: research, commercial use, derivative works
  • Include license notice and attribution in any distribution

⚠️ Important Notes

  • Model outputs are not medical advice; use for research/educational purposes only.
  • Performance may vary on text that differs substantially from the MedReadMe training domain.
  • Consider additional post-processing for production systems (e.g., confidence filtering).

☎️ Contact

For questions, issues, or licensing inquiries, open an issue on the model repository.

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DNivalis/med-jargon-crf

Finetuned
(378)
this model