CheX-Phi3.5V — Preference-Optimised Vision-Language Model for Chest X-ray Interpretation

CheX-Phi3.5V is a vision–language model (VLM) that answers clinical questions about chest radiographs and generates structured findings reports.
Built on Phi-3.5 Vision-Instruct (7 B), it introduces Direct Preference Optimization (DPO) and contrastive rejection to achieve fine-grained medical reasoning while suppressing hallucinations.

Key Features

Aspect	Description
Modality	Single-image chest radiography (frontal & lateral)
Tasks	Visual Question Answering (VQA) & Findings generation
Backbone	Phi-3.5 Vision 7 B with an enhanced visual projection layer
Optimisation	2-stage SFT → DPO + contrastive rejection learning
License	Apache 2.0 — free for research and commercial use

Quick Start

from transformers import AutoModelForVision2Seq, AutoProcessor

model_id  = "remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO"
model     = AutoModelForVision2Seq.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

inputs = processor(
    images="example_frontal.jpg",
    text="Question: What abnormalities are present?\nAnswer:",
    return_tensors="pt"
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(generated[0], skip_special_tokens=True))

Dependencies pip install transformers>=4.41.0 timm accelerate For batch inference or a Streamlit demo, see the scripts in the GitHub repo.

Available Checkpoints

HF Repo	Stage	Recommended Use
`CheX-Phi3.5-vision-instruct-DPO`	DPO	Production / evaluation
`CheX-Phi3.5-vision-instruct-SFT`	SFT (phase 2)	Further preference tuning
`Phi-3.5-vision-instruct`	Base	Custom fine-tuning

Training Data & Procedure

Stage	Data & Size	Objective
SFT	120 k QA triplets (`MIMIC-EXT VQA`)	Supervised instruction tuning
DPO	30 k preference-paired QA	Direct Preference Optimization
Contrastive	250 k unlabelled MIMIC-CXR images	Rejection learning to curb hallucinations

Hardware : 8 × A100 80 GB • FP16 • DeepSpeed ZeRO-3 Total steps ≈ 2.2 M.

Evaluation

Dataset	Split	Metric	Score
MIMIC-CXR VQA	test	Accuracy	0.894
OpenI CXR-QA	test	BLEU-4	79.4
Radiologist Turing Test	200 cases	Pass rate	61 %

Evaluation scripts are provided in stage3_evaluate_mimic_ext_vqa.sh.

Ethical & Safety Considerations

Clinical usage — Outputs are assistive only; a certified radiologist must confirm findings.
Bias — Training data skewed towards North-American adult populations; paediatric or non-western cohorts may underperform.
Privacy — MIMIC-CXR is fully de-identified; the model does not memorise PHI.
Hallucinations — Contrastive rejection reduces but does not eliminate false positives; use confidence thresholds.

Known Limitations

No generalisation to CT, MRI, or ultrasound modalities.
Sensitive to extreme noise & portable AP projections.
Knowledge cutoff = Mar 2023; newly described conditions may be missed.

Resources

Code & training scripts — https://github.com/remove4anonymous/CheX-Phi35V
Data utilities — tools/generate_visual_prompt.py
Demo notebook — demo.py

Citation

@misc{liu2025chexphi35v,
  title        = {CheX-Phi3.5V: Preference-Optimised Vision-Language Model for Chest X-ray Interpretation},
  author       = {Liu, Xiao and Others},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/remove4anonymous/CheX-Phi-3.5-vision-instruct-DPO}}
}

If you use CheX-Phi3.5V, please cite us and consider sharing your downstream results!

HaiZhiYan
/

CheX-Phi35V