Medical Jargon Identifier with CRF

A PyTorch model that performs fine-grained medical jargon identification using a RoBERTa-large backbone enhanced by a Conditional Random Field (CRF) layer.
Fine-tuned on the MedReadMe dataset introduced by Jiang & Xu (2024).

🧠 Overview

Architecture: RoBERTa-large → Linear classifier → CRF
Task: Token-level classification into 7 medical jargon categories + BIO tagging
Input: Raw English text (sentences or paragraphs)
Output: Word-level spans labeled with jargon type and boundaries

🎯 Supported Jargon Categories

Label (BIO)	Meaning
`medical-jargon-google-easy`	Easily Google-able medical terms
`medical-jargon-google-hard`	Complex, hard-to-Google medical terms
`medical-name-entity`	Named diseases, drugs, procedures
`general-complex`	Complex general vocabulary
`abbr-medical`	Medical abbreviations (e.g., ECG, CBC)
`abbr-general`	General abbreviations
`general-medical-multisense`	Words with both lay and medical meanings

📁 Files & Format

pytorch_model.bin – model weights
config.json – hyper-parameters & label map
tokenizer.json, vocab.json, merges.txt – RoBERTa tokenizer assets
modeling_jargon.py – custom CRFTokenClassificationModel class
requirements.txt – runtime dependencies

🔧 Quick Start

from transformers import AutoTokenizer
from modeling_jargon import CRFTokenClassificationModel
import torch

# 1. Load model and tokenizer
model_name = "DNivalis/med-jargon-crf"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
model = CRFTokenClassificationModel.from_pretrained(model_name)
model.eval()

# 2. Prepare input text
text = "The patient presented with elevated CRP and intermittent AF."
inputs = tokenizer(text, return_tensors="pt")

# 3. Run inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs["logits"]
    # Decode best sequence using CRF
    predicted_tags = model.decode(logits, inputs["attention_mask"])[0]

# 4. Extract spans from predictions
spans = [(i, model.id2label[tag_id]) for i, tag_id in enumerate(predicted_tags) if tag_id != 0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

# 5. Display results
print("Detected medical jargon:")
for token_idx, label in spans:
    # Find continuous spans of the same entity
    end_idx = token_idx + 1
    while (end_idx < len(predicted_tags) and 
           predicted_tags[end_idx] == predicted_tags[token_idx]):
        end_idx += 1
    
    # Convert tokens back to text
    detected_tokens = tokens[token_idx:end_idx]
    detected_text = tokenizer.convert_tokens_to_string(detected_tokens)
    
    print(f"{label}: '{detected_text.strip()}'")

🏥 Supported Tasks

Medical jargon detection – binary, 3-class, or 7-category granularity
Named-entity recognition – extract spans of medical interest
Readability analysis – density of jargon per sentence or document
Downstream QA & summarization – filter or simplify complex terms

🌍 Language

English only.

📚 Training Data

Fine-tuned on MedReadMe: 4,520 sentences with fine-grained jargon span annotations, including the novel Google-Easy and Google-Hard categories .

📖 Citation

If you use this model or the underlying dataset, please cite:

@article{jiang2024medreadmesystematicstudyfinegrained,
  title={MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain},
  author={Chao Jiang and Wei Xu},
  year={2024},
  eprint={2405.02144},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2405.02144}
}

📝 License & Usage

Licensed under Apache 2.0.

✅ Allowed: research, commercial use, derivative works
Include license notice and attribution in any distribution

⚠️ Important Notes

Model outputs are not medical advice; use for research/educational purposes only.
Performance may vary on text that differs substantially from the MedReadMe training domain.
Consider additional post-processing for production systems (e.g., confidence filtering).

☎️ Contact

For questions, issues, or licensing inquiries, open an issue on the model repository.

DNivalis
/

med-jargon-crf