CheX-Phi3.5V β Preference-Optimised Vision-Language Model for Chest X-ray Interpretation
CheX-Phi3.5V is a visionβlanguage model (VLM) that answers clinical questions about chest radiographs and generates structured findings reports.
Built on Phi-3.5 Vision-Instruct (7 B), it introduces Direct Preference Optimization (DPO) and contrastive rejection to achieve fine-grained medical reasoning while suppressing hallucinations.
Key Features
Aspect | Description |
---|---|
Modality | Single-image chest radiography (frontal & lateral) |
Tasks | Visual Question Answering (VQA) & Findings generation |
Backbone | Phi-3.5 Vision 7 B with an enhanced visual projection layer |
Optimisation | 2-stage SFT β DPO + contrastive rejection learning |
License | Apache 2.0 β free for research and commercial use |
Quick Start
from transformers import AutoModelForVision2Seq, AutoProcessor
model_id = "remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO"
model = AutoModelForVision2Seq.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
inputs = processor(
images="example_frontal.jpg",
text="Question: What abnormalities are present?\nAnswer:",
return_tensors="pt"
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(generated[0], skip_special_tokens=True))
Dependencies
pip install transformers>=4.41.0 timm accelerate
For batch inference or a Streamlit demo, see the scripts in the GitHub repo.
Available Checkpoints
HF Repo | Stage | Recommended Use |
---|---|---|
CheX-Phi3.5-vision-instruct-DPO |
DPO | Production / evaluation |
CheX-Phi3.5-vision-instruct-SFT |
SFT (phase 2) | Further preference tuning |
Phi-3.5-vision-instruct |
Base | Custom fine-tuning |
Training Data & Procedure
Stage | Data & Size | Objective |
---|---|---|
SFT | 120 k QA triplets (MIMIC-EXT VQA ) |
Supervised instruction tuning |
DPO | 30 k preference-paired QA | Direct Preference Optimization |
Contrastive | 250 k unlabelled MIMIC-CXR images | Rejection learning to curb hallucinations |
Hardware : 8 Γ A100 80 GB β’ FP16 β’ DeepSpeed ZeRO-3 Total steps β 2.2 M.
Evaluation
Dataset | Split | Metric | Score |
---|---|---|---|
MIMIC-CXR VQA | test | Accuracy | 0.894 |
OpenI CXR-QA | test | BLEU-4 | 79.4 |
Radiologist Turing Test | 200 cases | Pass rate | 61 % |
Evaluation scripts are provided in stage3_evaluate_mimic_ext_vqa.sh
.
Ethical & Safety Considerations
- Clinical usage β Outputs are assistive only; a certified radiologist must confirm findings.
- Bias β Training data skewed towards North-American adult populations; paediatric or non-western cohorts may underperform.
- Privacy β MIMIC-CXR is fully de-identified; the model does not memorise PHI.
- Hallucinations β Contrastive rejection reduces but does not eliminate false positives; use confidence thresholds.
Known Limitations
- No generalisation to CT, MRI, or ultrasound modalities.
- Sensitive to extreme noise & portable AP projections.
- Knowledge cutoff = Mar 2023; newly described conditions may be missed.
Resources
- Code & training scripts β https://github.com/remove4anonymous/CheX-Phi35V
- Data utilities β
tools/generate_visual_prompt.py
- Demo notebook β
demo.py
Citation
@misc{liu2025chexphi35v,
title = {CheX-Phi3.5V: Preference-Optimised Vision-Language Model for Chest X-ray Interpretation},
author = {Liu, Xiao and Others},
year = {2025},
howpublished = {\url{https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO}}
}
If you use CheX-Phi3.5V, please cite us and consider sharing your downstream results!
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for HaiZhiYan/CheX-Phi35V
Base model
microsoft/Phi-3.5-vision-instruct