File size: 1,171 Bytes
6bd59f7
156e0cd
df490c1
 
 
6bd59f7
df490c1
6bd59f7
df490c1
6bd59f7
df490c1
 
156e0cd
6bd59f7
 
df490c1
156e0cd
 
df490c1
156e0cd
 
df490c1
156e0cd
df490c1
 
3818e2b
 
 
 
 
156e0cd
df490c1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
title: Medical LLM Leaderboard
emoji: 🌎
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: apache-2.0
tags:
- leaderboard
short_description: A Benchmark of Large Language Models in the Clinic
---


We benchmark 22 LLMs in the clinic across 11 tasks, 7 metrics, 17 datasets, and over 20,000 test samples. 
We reveal that LLMs are poor clinical decision-makers in multiple complex clinical tasks.

Github: https://github.com/AI-in-Health/ClinicBench/
Paper: https://aclanthology.org/2024.emnlp-main.759.pdf

Please consider citing 📑 our papers if our repository is helpful to your work, thanks sincerely!

```BibTex
@inproceedings{Liu2024ClinicBench,
  title={Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark},
  author={Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton},
  booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2024}
}
```