---
title: Medical LLM Leaderboard
emoji: 🌎
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: apache-2.0
tags:
- leaderboard
short_description: A Benchmark of Large Language Models in the Clinic
---


We benchmark 22 LLMs in the clinic across 11 tasks, 7 metrics, 17 datasets, and over 20,000 test samples. 
We reveal that LLMs are poor clinical decision-makers in multiple complex clinical tasks.

Github: https://github.com/AI-in-Health/ClinicBench/

Paper: https://aclanthology.org/2024.emnlp-main.759.pdf

Please consider citing 📑 our papers if our repository is helpful to your work, thanks sincerely!

```BibTex
@inproceedings{Liu2024ClinicBench,
  title={Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark},
  author={Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton},
  booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2024}
}
```