|
--- |
|
library_name: transformers |
|
tags: |
|
- biobert |
|
- medical-nlp |
|
- icd-9 |
|
- classification |
|
- healthcare |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- dmis-lab/biobert-v1.1 |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# Model Card for BioBERT Fine-tuned on MIMIC-3 for ICD-9 Code Classification |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This is a BioBERT model fine-tuned on the MIMIC-3 (Medical Information Mart for Intensive Care) corpus specifically for ICD-9 code classification. The model is designed to predict medical diagnostic codes based on Electronic Health Record (EHR) and symptom text inputs. |
|
|
|
- **Developed by:** [Researcher/Institution Name - to be added] |
|
- **Model type:** Transformer-based medical language model (BioBERT) |
|
- **Language(s):** English (Medical Domain) |
|
- **License:** [License to be specified] |
|
- **Finetuned from model:** BioBERT base model |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [GitHub/Model Repository Link - to be added] |
|
- **Paper:** [Research Paper Link - to be added] |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
The primary use of this model is to automatically classify medical conditions by predicting relevant ICD-9 diagnostic codes from clinical text, such as electronic health records, medical notes, or symptom descriptions. |
|
|
|
### Downstream Use |
|
|
|
This model can be integrated into: |
|
- Clinical decision support systems |
|
- Medical coding automation |
|
- Electronic health record (EHR) analysis tools |
|
- Healthcare informatics research |
|
|
|
### Out-of-Scope Use |
|
|
|
- The model should not be used for direct medical diagnosis without professional medical oversight |
|
- It is not intended to replace clinical judgment |
|
- Performance may vary with text outside the medical domain or significantly different from the training corpus |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- The model's performance is limited to the medical conditions and coding patterns in the MIMIC-3 dataset |
|
- Potential biases from the original training data may be present |
|
- Accuracy can be affected by variations in medical terminology, writing styles, and complex medical cases |
|
|
|
### Recommendations |
|
|
|
- Validate model predictions with medical professionals |
|
- Use as a supportive tool, not a replacement for expert medical assessment |
|
- Regularly evaluate performance on new datasets |
|
- Be aware of potential demographic or contextual biases in the predictions |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
import torch |
|
|
|
# Load the model and tokenizer |
|
model = AutoModelForSequenceClassification.from_pretrained('model_path') |
|
tokenizer = AutoTokenizer.from_pretrained('model_path') |
|
|
|
# Example prediction function (similar to the provided get_predictions function) |
|
def predict_icd9_codes(input_text, threshold=0.8): |
|
# Tokenize input |
|
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512, padding='max_length') |
|
|
|
# Get model predictions |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.sigmoid(outputs.logits) |
|
|
|
# Filter predictions above threshold |
|
predicted_codes = [model.config.id2label[i] for i in (predictions > threshold).nonzero()[:, 1]] |
|
|
|
return predicted_codes |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- **Dataset:** MIMIC-3 Corpus |
|
- **Domain:** Medical/Clinical text |
|
- **Content:** Electronic Health Records (EHR) |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
- Text tokenization |
|
- Maximum sequence length: 512 tokens |
|
- Padding to uniform length |
|
- Potential text normalization techniques |
|
|
|
#### Training Hyperparameters |
|
- **Base Model:** BioBERT |
|
- **Training Regime:** Fine-tuning |
|
- **Precision:** [Specify training precision, e.g., mixed precision] |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
- Held-out subset of MIMIC-3 corpus |
|
- Diverse medical cases and documentation styles |
|
|
|
#### Metrics |
|
- Precision |
|
- Recall |
|
- F1-Score |
|
- Multi-label classification metrics |
|
|
|
## Environmental Impact |
|
|
|
- Estimated carbon emissions to be calculated |
|
- Compute details to be specified |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture |
|
- **Base Model:** BioBERT |
|
- **Task:** Multi-label ICD-9 Code Classification |
|
|
|
## Citation |
|
|
|
[Citation information to be added when research is published] |
|
|
|
## More Information |
|
|
|
For more details about the model's development, performance, and usage, please contact the model developers. |