|
--- |
|
library_name: transformers |
|
tags: |
|
- medical |
|
- vision-language |
|
- clip |
|
- various modalities |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for ConceptCLIP |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
**ConceptCLIP** is a large-scale vision-language pre-training model enhanced with medical concepts for diverse medical image modalities. It enables robust performance across multiple medical imaging tasks through concept-enhanced language-image alignment. |
|
|
|
- **Developed by:** Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Hao Chen |
|
- **Model type:** Vision-Language Pre-trained Model (Medical Specialized) |
|
- **Language(s):** English (text), Multi-modal (medical imaging) |
|
- **License:** MIT |
|
- **Finetuned from model:** Based on [OpenCLIP](https://github.com/mlfoundations/open_clip) |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [GitHub Project](https://github.com/JerrryNie/ConceptCLIP) |
|
- **Paper:** [ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Language-Image Pre-training](https://arxiv.org/abs/2501.15579) |
|
- **Demo:** [Hugging Face Model Hub](https://huggingface.co/JerrryNie/ConceptCLIP) |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
- Zero-shot medical image classification |
|
- Cross-modal retrieval |
|
- Zero-shot concept annotation |
|
- Extract features for whole-slide image analysis |
|
- Extract features for medical report generation |
|
|
|
### Downstream Use |
|
|
|
- Fine-tuning for specific medical imaging tasks (CT, MRI, X-ray analysis) for classification, and visual question answering |
|
- Concept bottleneck model for explanation |
|
- Integration into clinical decision support systems |
|
- Medical education and training tools |
|
|
|
### Out-of-Scope Use |
|
|
|
- Direct clinical diagnosis without clinical validation |
|
- Non-medical image analysis |
|
- General purpose vision tasks outside medical domain |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- Trained primarily on medical imaging data which may contain demographic biases |
|
- Performance may vary across different medical imaging modalities |
|
- Should not be used as sole diagnostic tool without human oversight |
|
|
|
### Recommendations |
|
|
|
- Validate outputs with clinical experts before medical decision making |
|
- Fine-tune on domain-specific data for specialized applications |
|
- Conduct bias analysis when deploying in new clinical environments |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoModel, AutoProcessor |
|
import torch |
|
from PIL import Image |
|
|
|
model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True) |
|
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True) |
|
|
|
image = Image.open('example_data/chest_X-ray.jpg').convert('RGB') |
|
labels = ['chest X-ray', 'brain MRI', 'skin lesion'] |
|
texts = [f'a medical image of {label}' for label in labels] |
|
|
|
inputs = processor( |
|
images=image, |
|
text=texts, |
|
return_tensors='pt', |
|
padding=True, |
|
truncation=True |
|
).to(model.device) |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
logits = (outputs['logit_scale'] * outputs['image_features'] @ outputs['text_features'].t()).softmax(dim=-1)[0] |
|
|
|
print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)}) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- Large-scale medical image-text pairs with concept information |
|
|
|
### Training Procedure |
|
|
|
- Built on OpenCLIP architecture with medical concept integration |
|
- Pre-training with image-text alignment (IT-Align) and patch-concept alignment (PC-Align) objectives |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- Base architecture: SigLIP-ViT-400M-16 + PubMedBERT |
|
- Training regime: Mixed precision training |
|
- Batch size: 12,288 w/o PC-Align, 6,144 w/ PC-Align |
|
- Learning rate: 5e-4 w/o PC-Align, 3e-4 w/ PC-Align |
|
|
|
|
|
## Evaluation |
|
|
|
### Testing Data & Metrics |
|
|
|
#### Testing Data |
|
|
|
- Evaluated on multiple open-sourced medical imaging benchmarks including medical image diagnosis, cross-modal retrieval, medical visual question answering, medical report generation, whole-slide image analysis, and explainable AI |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
``` |
|
@article{nie2025conceptclip, |
|
title={ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Contrastive Language-Image Pre-training}, |
|
author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Chen, Hao}, |
|
journal={arXiv preprint arXiv:2501.15579}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
**APA:** |
|
|
|
Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., & Chen, H. (2025). ConceptCLIP: Towards trustworthy medical AI via concept-enhanced contrastive language-image pre-training. arXiv preprint arXiv:2501.15579. |
|
|
|
## Model Card Contact |
|
|
|
Yuxiang Nie: [email protected] |