MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine
Abstract
Medblink benchmark evaluates the perceptual abilities of multimodal language models in clinical image interpretation, revealing significant gaps compared to human performance.
Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.
Community
Would you trust ChatGPT with your X-ray if it couldn't tell if the image is upside down?
We introduce MedBLINK, a benchmark that evaluates MLMs on basic perception tasks that are trivial for clinicians but often fail for AI.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images (2025)
- MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports (2025)
- SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning (2025)
- 3D-RAD: A Comprehensive 3D Radiology Med-VQA Dataset with Multi-Temporal Analysis and Diverse Diagnostic Tasks (2025)
- A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering (2025)
- CAPO: Reinforcing Consistent Reasoning in Medical Decision-Making (2025)
- How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper