File size: 4,036 Bytes
b409066
 
682a927
 
 
 
 
 
 
 
 
 
 
7013f30
 
 
 
 
 
 
b409066
 
 
 
 
 
 
682a927
b409066
 
 
682a927
 
 
 
 
b409066
682a927
b409066
2be1f00
 
b409066
682a927
 
 
 
 
 
 
 
 
 
 
 
 
 
b409066
682a927
 
b409066
 
682a927
 
 
 
 
 
b409066
682a927
 
b409066
682a927
 
 
b409066
 
 
 
 
682a927
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78fe38c
682a927
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb54e96
682a927
 
 
 
 
 
278b29f
b409066
 
cb54e96
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
library_name: transformers
license: mit
datasets:
- aai530-group6/ddxplus
language:
- en
metrics:
- precision
- recall
- f1
base_model:
- cambridgeltl/SapBERT-from-PubMedBERT-fulltext
tags:
- medical-diagnosis
- sapbert
- ddxplus
- pubmedbert
- disease-classification
- differential-diagnosis
---


## Model Details

### Model Description

This model is a fine-tuned version of cambridgeltl/SapBERT-from-PubMedBERT-fulltext on the DDXPlus dataset (10,000 samples) for medical diagnosis tasks.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** [Aashish Acharya](https://github.com/acharya-jyu)
- **Model type:** sapBERT-BioMedBERT
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** cambridgeltl/SapBERT-from-PubMedBERT-fulltext

### Model Sources 

- **Repository:** [cambridgeltl/SapBERT-from-PubMedBERT-fulltext](https://huggingface.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext)
- **Dataset:** [aai530-group6/ddxplus](https://huggingface.co/aai530-group6/ddxplus)

## Training Dataset
The model was trained on DDXPlus dataset (10,000 samples) containing:
- Patient cases with comprehensive medical information
- Differential diagnosis annotations
- 49 distinct medical conditions
- Evidence-based symptom-condition relationships
## Performance
### Final Metrics
- Test Precision: 0.9619
- Test Recall: 0.9610
- Test F1 Score: 0.9592
### Training Evolution
- Best Validation F1: 0.9728 (Epoch 4)
- Final Validation Loss: 0.6352

<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/7GK4e9jy4vKz9gSXU-dbh.png" width="400" alt="image">
<img src="https://cdn-uploads.huggingface.co/production/uploads/662757230601587f0be9781b/5b_O5oX0BISljP1kwdtTN.png" width="400" alt="image">


## Intended Use
This model is designed for:
- Medical diagnosis support
- Symptom analysis
- Disease classification
- Differential diagnosis generation

## Out-of-Scope Use
The model should NOT be used for:

- Direct medical diagnosis without professional oversight
- Critical healthcare decisions without human validation
- Clinical applications without proper testing and validation


## Training Details
### Training Procedure

- Optimizer: AdamW with weight decay (0.01)
-  Learning Rate: 1e-5
- Loss Function: Combined loss (0.8 × Focal Loss + 0.2 × KL Divergence)
- Batch Size: 32
- Gradient Clipping: 1.0
- Early Stopping: Patience of 3 epochs
- Training Strategy: Cross-validation with 5 folds


### Model Architecture

- Base Model: cambridgeltl/SapBERT-from-PubMedBERT-fulltext
- Hidden Size: 768
- Attention Heads: 12
- Dropout Rate: 0.5
- Added classification layers for diagnostic tasks
- Layer normalization and dropout for regularization

## Example Usage
<pre>
  from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
model_name = "acharya-jyu/sapbert-pubmedbert-ddxplus-10k"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example input structure
input_data = {
    'age': 45,  # Patient age
    'sex': 'M',  # Patient sex: 'M' or 'F'
    'initial_evidence': 'E_91',  # Initial evidence code (e.g., E_91 for fever)
    'evidences': [
        'E_91',  # Fever
        'E_77',  # Cough
        'E_89'   # Fatigue
    ]
}

# Process demographic data and evidence codes
outputs = model(**input_data)

# Outputs will include:
# - Main diagnosis prediction
# - Differential diagnosis probabilities
# - Confidence scores
</pre>
<b>Note: Evidence codes (E_XX) correspond to specific symptoms and conditions defined in the release_evidences.json file. The model expects these standardized codes rather than raw text input.</b>

## Citation
```bibtex
  @misc{acharya2024sapbert,
  title={SapBERT-PubMedBERT Fine-tuned on DDXPlus Dataset},
  author={Acharya, Aashish},
  year={2024},
  publisher={Hugging Face Model Hub}
}
```

## Model Card Contact
[Aashish Acharya](https://github.com/acharya-jyu)