Spaces:
Sleeping
Sleeping
File size: 1,171 Bytes
6bd59f7 156e0cd df490c1 6bd59f7 df490c1 6bd59f7 df490c1 6bd59f7 df490c1 156e0cd 6bd59f7 df490c1 156e0cd df490c1 156e0cd df490c1 156e0cd df490c1 3818e2b 156e0cd df490c1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
---
title: Medical LLM Leaderboard
emoji: 🌎
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: apache-2.0
tags:
- leaderboard
short_description: A Benchmark of Large Language Models in the Clinic
---
We benchmark 22 LLMs in the clinic across 11 tasks, 7 metrics, 17 datasets, and over 20,000 test samples.
We reveal that LLMs are poor clinical decision-makers in multiple complex clinical tasks.
Github: https://github.com/AI-in-Health/ClinicBench/
Paper: https://aclanthology.org/2024.emnlp-main.759.pdf
Please consider citing 📑 our papers if our repository is helpful to your work, thanks sincerely!
```BibTex
@inproceedings{Liu2024ClinicBench,
title={Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark},
author={Fenglin Liu, Zheng Li, Hongjian Zhou, Qingyu Yin, Jingfeng Yang, Xianfeng Tang, Chen Luo, Ming Zeng, Haoming Jiang, Yifan Gao, Priyanka Nigam, Sreyashi Nag, Bing Yin, Yining Hua, Xuan Zhou, Omid Rohanian, Anshul Thakur, Lei Clifton, David A. Clifton},
booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2024}
}
``` |