CheX-Phi3.5V β€” Preference-Optimised Vision-Language Model for Chest X-ray Interpretation

CheX-Phi3.5V is a vision–language model (VLM) that answers clinical questions about chest radiographs and generates structured findings reports.
Built on Phi-3.5 Vision-Instruct (7 B), it introduces Direct Preference Optimization (DPO) and contrastive rejection to achieve fine-grained medical reasoning while suppressing hallucinations.


Key Features

Aspect Description
Modality Single-image chest radiography (frontal & lateral)
Tasks Visual Question Answering (VQA) & Findings generation
Backbone Phi-3.5 Vision 7 B with an enhanced visual projection layer
Optimisation 2-stage SFT β†’ DPO + contrastive rejection learning
License Apache 2.0 β€” free for research and commercial use

Quick Start

from transformers import AutoModelForVision2Seq, AutoProcessor

model_id  = "remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO"
model     = AutoModelForVision2Seq.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

inputs = processor(
    images="example_frontal.jpg",
    text="Question: What abnormalities are present?\nAnswer:",
    return_tensors="pt"
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(generated[0], skip_special_tokens=True))

Dependencies pip install transformers>=4.41.0 timm accelerate For batch inference or a Streamlit demo, see the scripts in the GitHub repo.


Available Checkpoints

HF Repo Stage Recommended Use
CheX-Phi3.5-vision-instruct-DPO DPO Production / evaluation
CheX-Phi3.5-vision-instruct-SFT SFT (phase 2) Further preference tuning
Phi-3.5-vision-instruct Base Custom fine-tuning

Training Data & Procedure

Stage Data & Size Objective
SFT 120 k QA triplets (MIMIC-EXT VQA) Supervised instruction tuning
DPO 30 k preference-paired QA Direct Preference Optimization
Contrastive 250 k unlabelled MIMIC-CXR images Rejection learning to curb hallucinations

Hardware : 8 Γ— A100 80 GB β€’ FP16 β€’ DeepSpeed ZeRO-3 Total steps β‰ˆ 2.2 M.


Evaluation

Dataset Split Metric Score
MIMIC-CXR VQA test Accuracy 0.894
OpenI CXR-QA test BLEU-4 79.4
Radiologist Turing Test 200 cases Pass rate 61 %

Evaluation scripts are provided in stage3_evaluate_mimic_ext_vqa.sh.


Ethical & Safety Considerations

  • Clinical usage β€” Outputs are assistive only; a certified radiologist must confirm findings.
  • Bias β€” Training data skewed towards North-American adult populations; paediatric or non-western cohorts may underperform.
  • Privacy β€” MIMIC-CXR is fully de-identified; the model does not memorise PHI.
  • Hallucinations β€” Contrastive rejection reduces but does not eliminate false positives; use confidence thresholds.

Known Limitations

  1. No generalisation to CT, MRI, or ultrasound modalities.
  2. Sensitive to extreme noise & portable AP projections.
  3. Knowledge cutoff = Mar 2023; newly described conditions may be missed.

Resources


Citation

@misc{liu2025chexphi35v,
  title        = {CheX-Phi3.5V: Preference-Optimised Vision-Language Model for Chest X-ray Interpretation},
  author       = {Liu, Xiao and Others},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO}}
}

If you use CheX-Phi3.5V, please cite us and consider sharing your downstream results!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for HaiZhiYan/CheX-Phi35V

Finetuned
(19)
this model

Space using HaiZhiYan/CheX-Phi35V 1