Medical Jargon Identifier with CRF
A PyTorch model that performs fine-grained medical jargon identification using a RoBERTa-large backbone enhanced by a Conditional Random Field (CRF) layer.
Fine-tuned on the MedReadMe dataset introduced by Jiang & Xu (2024).
π§ Overview
- Architecture: RoBERTa-large β Linear classifier β CRF
- Task: Token-level classification into 7 medical jargon categories + BIO tagging
- Input: Raw English text (sentences or paragraphs)
- Output: Word-level spans labeled with jargon type and boundaries
π― Supported Jargon Categories
Label (BIO) | Meaning |
---|---|
medical-jargon-google-easy |
Easily Google-able medical terms |
medical-jargon-google-hard |
Complex, hard-to-Google medical terms |
medical-name-entity |
Named diseases, drugs, procedures |
general-complex |
Complex general vocabulary |
abbr-medical |
Medical abbreviations (e.g., ECG, CBC) |
abbr-general |
General abbreviations |
general-medical-multisense |
Words with both lay and medical meanings |
π Files & Format
pytorch_model.bin
β model weightsconfig.json
β hyper-parameters & label maptokenizer.json
,vocab.json
,merges.txt
β RoBERTa tokenizer assetsmodeling_jargon.py
β customCRFTokenClassificationModel
classrequirements.txt
β runtime dependencies
π§ Quick Start
from transformers import AutoTokenizer
from modeling_jargon import CRFTokenClassificationModel
import torch
# 1. Load model and tokenizer
model_name = "DNivalis/med-jargon-crf"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
model = CRFTokenClassificationModel.from_pretrained(model_name)
model.eval()
# 2. Prepare input text
text = "The patient presented with elevated CRP and intermittent AF."
inputs = tokenizer(text, return_tensors="pt")
# 3. Run inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs["logits"]
# Decode best sequence using CRF
predicted_tags = model.decode(logits, inputs["attention_mask"])[0]
# 4. Extract spans from predictions
spans = [(i, model.id2label[tag_id]) for i, tag_id in enumerate(predicted_tags) if tag_id != 0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# 5. Display results
print("Detected medical jargon:")
for token_idx, label in spans:
# Find continuous spans of the same entity
end_idx = token_idx + 1
while (end_idx < len(predicted_tags) and
predicted_tags[end_idx] == predicted_tags[token_idx]):
end_idx += 1
# Convert tokens back to text
detected_tokens = tokens[token_idx:end_idx]
detected_text = tokenizer.convert_tokens_to_string(detected_tokens)
print(f"{label}: '{detected_text.strip()}'")
π₯ Supported Tasks
- Medical jargon detection β binary, 3-class, or 7-category granularity
- Named-entity recognition β extract spans of medical interest
- Readability analysis β density of jargon per sentence or document
- Downstream QA & summarization β filter or simplify complex terms
π Language
English only.
π Training Data
Fine-tuned on MedReadMe: 4,520 sentences with fine-grained jargon span annotations, including the novel Google-Easy and Google-Hard categories .
π Citation
If you use this model or the underlying dataset, please cite:
@article{jiang2024medreadmesystematicstudyfinegrained,
title={MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain},
author={Chao Jiang and Wei Xu},
year={2024},
eprint={2405.02144},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2405.02144}
}
π License & Usage
Licensed under Apache 2.0.
- β Allowed: research, commercial use, derivative works
- Include license notice and attribution in any distribution
β οΈ Important Notes
- Model outputs are not medical advice; use for research/educational purposes only.
- Performance may vary on text that differs substantially from the MedReadMe training domain.
- Consider additional post-processing for production systems (e.g., confidence filtering).
βοΈ Contact
For questions, issues, or licensing inquiries, open an issue on the model repository.
- Downloads last month
- 48
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for DNivalis/med-jargon-crf
Base model
FacebookAI/roberta-large