Spaces:

aaekay
/

medical-classification-llm-leaderboard

Sleeping

App Files Files Community

Amit Kumar commited on Apr 17

Commit

c44f168

1 Parent(s): 0f6aea4

added accuracy and corrected the lists

Browse files

Files changed (1) hide show

about/description.md +8 -8

about/description.md CHANGED Viewed

@@ -1,5 +1,5 @@
 <h2 style="color: #00ff00;">Goal:</h2>
-The goal of Classification Medical LLM Leaderboard is to track, rank and evaluate the performance of large language models (LLMs) on medical classification tasks. It evaluates LLMs across a diverse array of medical datasets, starting with radiological reports dataset as follows:
 | S.No | Dataset Name  | About the Dataset  | Type of Classification   | Link to the Dataset    |
 |------|---------|---|-------------------------|---------|
@@ -9,14 +9,14 @@ The goal of Classification Medical LLM Leaderboard is to track, rank and evaluat
 The leaderboard offers a comprehensive assessment of each model's classification aspects.
 <h2 style="color: #00ff00;">Evaluation Criteria:</h2>
-1. Accuracy: The primary metric used for evaluation is accuracy, which measures the proportion of correct predictions made by the model.
-<h2 style="color: #00ff00;">Different Parameters:</h2>
-The leaderboard displays the different type of settings explored to get various results
-1. <b> Different shots prompting </b>: 0 shot, 1 shot, 5 shots.
-2. <b> Different roles </b> : Some models like llama allows to assign different content to different roles like system or user in this case.
-3. <b> Chain of thought prompting: </b> A kind of prompting technique where, a complex task is broken down into simple steps. It is proven to be better than simple prompting. Refer [COT prompting](https://www.promptingguide.ai/techniques/cot)
 4. <b> Active Prompt: </b> In progress, it will be used along with different shots prompting
 <h2 style="color: #00ff00;">Submit your model or dataset:</h2>

 <h2 style="color: #00ff00;">Goal:</h2>
+The goal of Classification Medical NLP Leaderboard is to track, rank and evaluate the performance of large language models (LLMs) on medical classification tasks. It evaluates LLMs across a diverse array of medical datasets, starting with radiological reports dataset as follows:
 | S.No | Dataset Name  | About the Dataset  | Type of Classification   | Link to the Dataset    |
 |------|---------|---|-------------------------|---------|
 The leaderboard offers a comprehensive assessment of each model's classification aspects.
 <h2 style="color: #00ff00;">Evaluation Criteria:</h2>
+The primary metric used for evaluation is accuracy, which measures the proportion of correct predictions made by the model. We used two levels of accruacy <br>
+1. <b> Label-level accuracy <b>: Accuracy is measured in terms of total labels.<br>
+2. <b> Record-level accuracy <b>: Accuracy is measured if a report is classified accurately across all labels.
+<h2 style="color: #00ff00;">Different Parameters:</h2> The leaderboard displays the different type of settings explored to get various results <br>
+1. <b> Different shots prompting </b>: 0 shot, 1 shot, 5 shots. <br>
+2. <b> Different roles </b> : Some models like llama allows to assign different content to different roles like system or user in this case. <br>
+3. <b> Chain of thought prompting: </b> A kind of prompting technique where, a complex task is broken down into simple steps. It is proven to be better than simple prompting. Refer [COT prompting](https://www.promptingguide.ai/techniques/cot) <br>
 4. <b> Active Prompt: </b> In progress, it will be used along with different shots prompting
 <h2 style="color: #00ff00;">Submit your model or dataset:</h2>