metadata

library_name: transformers
tags:
  - medical
  - vision-language
  - clip
  - various modalities
license: mit
language:
  - en

Model Card for ConceptCLIP

Model Details

Model Description

ConceptCLIP is a large-scale vision-language pre-training model enhanced with medical concepts for diverse medical image modalities. It enables robust performance across multiple medical imaging tasks through concept-enhanced language-image alignment.

Developed by: Yuxiang Nie, Sunan He, Yequan Bie, Yihui Wang, Zhixuan Chen, Shu Yang, Hao Chen
Model type: Vision-Language Pre-trained Model (Medical Specialized)
Language(s): English (text), Multi-modal (medical imaging)
License: MIT
Finetuned from model: Based on OpenCLIP

Model Sources

Repository: GitHub Project
Paper: ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Language-Image Pre-training
Demo: Hugging Face Model Hub

Uses

Direct Use

Zero-shot medical image classification
Cross-modal retrieval
Zero-shot concept annotation
Extract features for whole-slide image analysis
Extract features for medical report generation

Downstream Use

Fine-tuning for specific medical imaging tasks (CT, MRI, X-ray analysis) for classification, and visual question answering
Concept bottleneck model for explanation
Integration into clinical decision support systems
Medical education and training tools

Out-of-Scope Use

Direct clinical diagnosis without clinical validation
Non-medical image analysis
General purpose vision tasks outside medical domain

Bias, Risks, and Limitations

Trained primarily on medical imaging data which may contain demographic biases
Performance may vary across different medical imaging modalities
Should not be used as sole diagnostic tool without human oversight

Recommendations

Validate outputs with clinical experts before medical decision making
Fine-tune on domain-specific data for specialized applications
Conduct bias analysis when deploying in new clinical environments

How to Get Started with the Model

from transformers import AutoModel, AutoProcessor
import torch
from PIL import Image

model = AutoModel.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('JerrryNie/ConceptCLIP', trust_remote_code=True)

image = Image.open('example_data/chest_X-ray.jpg').convert('RGB')
labels = ['chest X-ray', 'brain MRI', 'skin lesion']
texts = [f'a medical image of {label}' for label in labels]

inputs = processor(
    images=image, 
    text=texts,
    return_tensors='pt',
    padding=True,
    truncation=True
).to(model.device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = (outputs['logit_scale'] * outputs['image_features'] @ outputs['text_features'].t()).softmax(dim=-1)[0]

print({label: f"{prob:.2%}" for label, prob in zip(labels, logits)})

Training Details

Training Data

Large-scale medical image-text pairs with concept information

Training Procedure

Built on OpenCLIP architecture with medical concept integration
Pre-training with image-text alignment (IT-Align) and patch-concept alignment (PC-Align) objectives

Training Hyperparameters

Base architecture: SigLIP-ViT-400M-16 + PubMedBERT
Training regime: Mixed precision training
Batch size: 12,288 w/o PC-Align, 6,144 w/ PC-Align
Learning rate: 5e-4 w/o PC-Align, 3e-4 w/ PC-Align

Evaluation

Testing Data & Metrics

Testing Data

Evaluated on multiple open-sourced medical imaging benchmarks including medical image diagnosis, cross-modal retrieval, medical visual question answering, medical report generation, whole-slide image analysis, and explainable AI

Citation

BibTeX:

@article{nie2025conceptclip,
  title={ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Contrastive Language-Image Pre-training},
  author={Nie, Yuxiang and He, Sunan and Bie, Yequan and Wang, Yihui and Chen, Zhixuan and Yang, Shu and Chen, Hao},
  journal={arXiv preprint arXiv:2501.15579},
  year={2025}
}

APA:

Nie, Y., He, S., Bie, Y., Wang, Y., Chen, Z., Yang, S., & Chen, H. (2025). ConceptCLIP: Towards trustworthy medical AI via concept-enhanced contrastive language-image pre-training. arXiv preprint arXiv:2501.15579.

Model Card Contact

Yuxiang Nie: [email protected]